[00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191210T0000). [00:00:04] kart_: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:11] Hmm, no kartik for SWAT… [00:00:35] Anyone else from the Language team around? Is this urgent or can we wait for someone to be present for confirmation? [00:01:05] * James_F doesn't know anything about CX Campaigns. [00:08:04] OK, will mark it as undeployed. [00:10:10] SWAT closed. [00:12:43] RECOVERY - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [00:23:53] James_F, awight - thanks for all the assistance today. [00:24:16] brennen: No problem. I'm now manually fixing a bunch of broken pages as a way of making the errors go away. ;-) [00:24:41] lol [00:24:47] Is there a lot of broken pages? [00:24:47] MediaWiki is a little too kind on the GIGO issues. [00:25:08] Reedy: Yes, and most of them are broken by the catastrophe that is the sfn template, so can't be fixed. [00:25:15] ?action=delete [00:25:37] Well, yes, but if you think SuperProtect went poorly… ;-) [00:25:53] Edits like https://en.wikipedia.org/w/index.php?title=Rohingya_genocide&diff=prev&oldid=930062640&diffmode=source are pretty easy fixes, though. [00:26:27] Broken pages end up in https://en.wikipedia.org/wiki/Category:Pages_with_reference_errors [00:26:52] >The following 200 pages are in this category, out of 5,448 total. [00:26:55] sadface [00:27:09] Yeah, as I said, GIGO. [00:28:10] I wonder if any of the automated tools... [00:28:20] /semi-automated [00:30:54] where's wiki blame again? ;) [00:32:10] * James_F points at anyone else. ;-) [00:34:55] James_F: omg you took the bait! Thank you for your sacrifice. [00:35:02] awight: :-P [00:35:18] awight: there's plenty for everyone ;) [00:35:53] People really should only be allowed to edit with VE. It fixes all these Cite issues before they occur. [00:36:23] Tell that to the Translate extension... [00:36:43] Real content isn't written with Translate. [00:36:53] CX works beautifully with VE. [00:37:23] {Cite web [00:37:26] And that's not even the broken one [00:37:52] Why'd that be broken? [00:38:14] Maybe the name of the reference just happens to mostly look like wikitext? ;-) [00:38:26] Don't assume! You just hate the community! Etc. [00:38:58] (03CR) 10Andrew Bogott: [C: 03+1] "this is now unblocked!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549936 (https://phabricator.wikimedia.org/T161553) (owner: 10Andrew Bogott) [00:41:14] Looks like someone's written a bad script/gadget that's going around renaming references to "autogenerated1" etc. and leaving other uses behind. [00:41:16] * James_F sighs. [00:43:53] lol, a ref a-z aa and ab, no text provided [00:44:10] 10Operations, 10DNS, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10leila) [00:44:20] Often they were provided and someone vandalised/mis-edited to remove it. [00:45:16] This is a good one... [00:45:19] [[Category:Pages with reference errors]] added manually [00:45:24] No reference errors on the page [00:45:29] Ha. [00:45:34] Gotta love our users. [00:51:40] It's slightly a shame it doesn't use slightly more specific category names [00:53:13] It does for some errors. [00:54:10] https://en.wikipedia.org/wiki/Category:Pages_with_broken_reference_names https://en.wikipedia.org/wiki/Category:Pages_with_incorrect_ref_formatting etc. [00:56:57] Any of those the ones we care about? [00:59:23] I think the spammy line is particularly triggered by re-definition of a reference, specifically. [01:01:32] tuned out for a bit there - anything specific i should do to help? [01:01:49] (03Abandoned) 10Andrew Bogott: puppet-merge: add some conftool extras [puppet] - 10https://gerrit.wikimedia.org/r/413745 (https://phabricator.wikimedia.org/T157133) (owner: 10Andrew Bogott) [01:10:12] Could almost do with a simplified logstash board for this [01:10:19] literally server and pagename that match [01:37:46] 10Operations, 10DNS, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10Reedy) Are there any subdomains etc? Or does the domain only need to point at `171.64.75.80`? [02:40:35] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1078.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:17:21] 10Operations, 10DNS, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10BBlack) I'm assuming that, for now, the hosting of the web service (and email?) is not moving, just the whois ownership and DNS service? We usually need a fair bit more information tha... [04:45:23] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:04:01] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:44] !log Remove triggers from db2095:3314 for ar_comment - T234704 [06:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:51] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [06:08:50] !log Deploy schema change on s4 codfw master (this will generate lag on s4 codfw) T233135 [06:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:55] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [06:39:14] !log Remove db1062 from tendril and zarcillo T239188 [06:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:19] T239188: Decommission db1062.eqiad.wmnet - https://phabricator.wikimedia.org/T239188 [06:42:47] (03PS1) 10Marostegui: mariadb: db1062 set it to spare [puppet] - 10https://gerrit.wikimedia.org/r/556101 (https://phabricator.wikimedia.org/T239188) [06:44:12] (03CR) 10Marostegui: [C: 03+2] mariadb: db1062 set it to spare [puppet] - 10https://gerrit.wikimedia.org/r/556101 (https://phabricator.wikimedia.org/T239188) (owner: 10Marostegui) [06:48:17] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Marostegui) The new hosts for es4 and es5 have been ordered and will be most likely set up (installed and racked... [07:44:49] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:22] (03PS1) 10Phedenskog: icinga: Add all WebPageReplay alerts. [puppet] - 10https://gerrit.wikimedia.org/r/556108 (https://phabricator.wikimedia.org/T198287) [08:01:58] (03CR) 10jerkins-bot: [V: 04-1] icinga: Add all WebPageReplay alerts. [puppet] - 10https://gerrit.wikimedia.org/r/556108 (https://phabricator.wikimedia.org/T198287) (owner: 10Phedenskog) [08:02:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555761 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [08:04:15] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:52] (03PS2) 10Phedenskog: icinga: Add all WebPageReplay alerts. [puppet] - 10https://gerrit.wikimedia.org/r/556108 (https://phabricator.wikimedia.org/T198287) [08:07:47] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:12] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10elukey) @herron hello :) Any comment on what I wrote above about cronspam? [08:18:52] (03PS3) 10Giuseppe Lavagetto: blubberoid: break TLS functionality into a helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/554832 (https://phabricator.wikimedia.org/T235411) [08:19:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "I'll fix the remaining inconsistency in the followup patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554832 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [08:19:32] (03Merged) 10jenkins-bot: blubberoid: break TLS functionality into a helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/554832 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [08:27:52] (03PS5) 10Effie Mouzeli: mediawiki: Check APCu fragmentation in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/555950 (https://phabricator.wikimedia.org/T240205) [08:30:24] (03PS5) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) [08:30:34] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Check APCu fragmentation in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/555950 (https://phabricator.wikimedia.org/T240205) (owner: 10Effie Mouzeli) [08:41:04] (03CR) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [08:48:27] (03PS3) 10Giuseppe Lavagetto: Create common template helpers directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/554833 (https://phabricator.wikimedia.org/T235411) [08:48:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] Standardizes English dictionaries on hunspell for English in ORES (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556023 (https://phabricator.wikimedia.org/T239942) (owner: 10Halfak) [08:50:47] (03PS1) 10Andrew Bogott: nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) [08:51:23] (03CR) 10jerkins-bot: [V: 04-1] nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [08:52:09] (03CR) 10Muehlenhoff: "ceph is packaged in Debian, why are we not using these packages? There might be a valid technical reason, but at least it's not mentioned " [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) (owner: 10Jhedden) [08:53:21] (03PS2) 10Andrew Bogott: nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) [08:53:56] (03CR) 10jerkins-bot: [V: 04-1] nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [08:55:18] (03CR) 10Arturo Borrero Gonzalez: aptrepo: add ceph nautilus repo for cloudvps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) (owner: 10Jhedden) [08:55:54] (03PS3) 10Andrew Bogott: nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) [08:59:16] (03PS1) 10Effie Mouzeli: mediawiki: Check APCu fragmentation in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/556136 [08:59:45] PROBLEM - Check systemd state on wtp1044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:22] (03PS2) 10Effie Mouzeli: mediawiki: Check APCu fragmentation in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/556136 (https://phabricator.wikimedia.org/T240205) [09:00:25] PROBLEM - Check systemd state on mw1301 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:43] PROBLEM - Check the last execution of php7.2-fpm_check_restart on wtp1044 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:01:51] ^ that is me [09:01:55] fixing [09:02:21] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Check APCu fragmentation in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/556136 (https://phabricator.wikimedia.org/T240205) (owner: 10Effie Mouzeli) [09:02:56] (03PS1) 10Giuseppe Lavagetto: blubberoid: release new chart version using the common templates directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/556137 (https://phabricator.wikimedia.org/T235411) [09:03:59] PROBLEM - Check systemd state on mw1322 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:07] PROBLEM - Check systemd state on mw2202 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:49] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw1322 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:07:15] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw1301 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:08:19] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw2202 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:08:19] PROBLEM - Check systemd state on mw2180 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:44] effie: --^ [09:09:31] elukey: I know :p [09:09:36] [11:01] | fixing [09:10:03] ah yes sorry! :) [09:10:25] hehe [09:10:29] (03PS2) 10Muehlenhoff: Turn old LDAP replicas into spares [puppet] - 10https://gerrit.wikimedia.org/r/555990 (https://phabricator.wikimedia.org/T224557) [09:10:35] PROBLEM - Check systemd state on mw1348 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:11] !log Restart MySQL on labsdb1012 [09:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:39] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw2180 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:12:00] ACKNOWLEDGEMENT - Check systemd state on mw1301 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:00] ACKNOWLEDGEMENT - Check the last execution of php7.2-fpm_check_restart on mw1301 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:12:00] ACKNOWLEDGEMENT - Check systemd state on mw1322 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:00] ACKNOWLEDGEMENT - Check the last execution of php7.2-fpm_check_restart on mw1322 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:12:00] ACKNOWLEDGEMENT - Check systemd state on mw1348 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:00] ACKNOWLEDGEMENT - Check systemd state on mw2180 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:00] ACKNOWLEDGEMENT - Check the last execution of php7.2-fpm_check_restart on mw2180 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:12:01] ACKNOWLEDGEMENT - Check systemd state on mw2202 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:01] ACKNOWLEDGEMENT - Check the last execution of php7.2-fpm_check_restart on mw2202 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:12:02] ACKNOWLEDGEMENT - Check systemd state on mw2259 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:02] ACKNOWLEDGEMENT - Check the last execution of php7.2-fpm_check_restart on mw2259 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:12:03] ACKNOWLEDGEMENT - Check systemd state on wtp1044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:03] ACKNOWLEDGEMENT - Check the last execution of php7.2-fpm_check_restart on wtp1044 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart Effie Mouzeli Fixed, it will go away https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:13:39] RECOVERY - Check systemd state on mw2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:49] (03CR) 10Muehlenhoff: [C: 03+2] Turn old LDAP replicas into spares [puppet] - 10https://gerrit.wikimedia.org/r/555990 (https://phabricator.wikimedia.org/T224557) (owner: 10Muehlenhoff) [09:13:53] RECOVERY - Check systemd state on wtp1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:33] RECOVERY - Check systemd state on mw1301 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:35] RECOVERY - Check systemd state on mw1322 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:45] RECOVERY - Check systemd state on mw2202 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:11] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw1348 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:17:35] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw1322 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:17:39] RECOVERY - Check systemd state on mw1348 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:59] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw1301 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:18:04] sorry for the noise :) [09:19:05] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw2202 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:19:53] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:20:46] !log Restart mysql on dbstore1003, 1004 and 1005 for upgrade [09:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:47] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw1348 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:22:11] (03PS1) 10Ema: varnishmtail: new buckets for varnishttfb, use seconds for sum [puppet] - 10https://gerrit.wikimedia.org/r/556139 (https://phabricator.wikimedia.org/T240180) [09:22:25] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw2180 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:23:13] RECOVERY - Check the last execution of php7.2-fpm_check_restart on wtp1044 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:36:15] (03PS1) 10Ema: ATS: increase log_buffer_size and max_line_size [puppet] - 10https://gerrit.wikimedia.org/r/556141 (https://phabricator.wikimedia.org/T237608) [09:36:17] (03PS1) 10Ema: ATS: stop logging BereqURL at the TLS layer too [puppet] - 10https://gerrit.wikimedia.org/r/556142 (https://phabricator.wikimedia.org/T237608) [09:38:19] (03CR) 10Elukey: "Left some comments!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/556072 (owner: 10EBernhardson) [09:38:59] (03CR) 10Elukey: Deploy analytics-search keytab to an-airflow (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556072 (owner: 10EBernhardson) [09:39:00] 10Operations, 10Puppet, 10Packaging, 10User-jbond: Create a resources for installing components - https://phabricator.wikimedia.org/T240324 (10jbond) p:05Triage→03Normal [09:39:14] (03CR) 10Filippo Giunchedi: [C: 03+1] varnishmtail: new buckets for varnishttfb, use seconds for sum [puppet] - 10https://gerrit.wikimedia.org/r/556139 (https://phabricator.wikimedia.org/T240180) (owner: 10Ema) [09:40:06] (03CR) 10Jbond: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [09:42:24] (03CR) 10Ema: [C: 03+2] varnishmtail: new buckets for varnishttfb, use seconds for sum [puppet] - 10https://gerrit.wikimedia.org/r/556139 (https://phabricator.wikimedia.org/T240180) (owner: 10Ema) [09:43:49] (03PS1) 10Marostegui: db-eqiad.php: Depool db1127 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556143 (https://phabricator.wikimedia.org/T183485) [09:44:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge-calico: Set up yaml and config to use calicoctl as a pod [puppet] - 10https://gerrit.wikimedia.org/r/554969 (https://phabricator.wikimedia.org/T239406) (owner: 10Bstorm) [09:45:31] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): Degraded RAID on cloudelastic1002 - https://phabricator.wikimedia.org/T239957 (10Gehel) [09:45:38] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1127 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556143 (https://phabricator.wikimedia.org/T183485) (owner: 10Marostegui) [09:46:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1127 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556143 (https://phabricator.wikimedia.org/T183485) (owner: 10Marostegui) [09:48:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1127 for table defragmentation (duration: 00m 59s) [09:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:37] marostegui: thank you for the github replication task :) I it is fixed [09:49:45] hashar: cool thanks! :) [09:50:22] great! [09:51:28] !log Optimize wikishared. cx_corpora on db1127 - T183485 [09:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:35] T183485: Please consider purging/moving the cx_corpora table at x1 - https://phabricator.wikimedia.org/T183485 [09:53:19] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:54:31] (03CR) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [09:57:45] (03PS1) 10Jcrespo: Revert "Revert "bacula: Schedule hourly copies of production backups to the offsite pool"" [puppet] - 10https://gerrit.wikimedia.org/r/556144 [09:59:42] PROBLEM - Check systemd state on mw1298 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:48] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw1298 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:59:57] (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "bacula: Schedule hourly copies of production backups to the offsite pool"" [puppet] - 10https://gerrit.wikimedia.org/r/556144 (owner: 10Jcrespo) [10:02:39] (03CR) 10Volans: "Reply inline" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554543 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:04:50] 10Operations, 10Puppet, 10Packaging, 10User-jbond: Create a resources for installing components - https://phabricator.wikimedia.org/T240324 (10MoritzMuehlenhoff) One implementation note: The priority is somewhat dependent on the use case and should probably be abstracted away: Everything which arrives from... [10:04:57] 10Operations, 10Puppet, 10Packaging, 10User-jbond: Create a resources for installing components - https://phabricator.wikimedia.org/T240324 (10jbond) Tagging https://phabricator.wikimedia.org/T178575 as its related [10:05:34] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [10:06:39] !log add new disk to RAID array on cloudelastic1002 - T239957 [10:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:45] T239957: Degraded RAID on cloudelastic1002 - https://phabricator.wikimedia.org/T239957 [10:07:05] (03CR) 10Ema: [C: 03+2] ATS: increase log_buffer_size and max_line_size [puppet] - 10https://gerrit.wikimedia.org/r/556141 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [10:08:38] (03CR) 10Arturo Borrero Gonzalez: "@phamhi please take over this patch, refresh it (see godog comments) and merge when appropriate, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/554853 (https://phabricator.wikimedia.org/T224585) (owner: 10Arturo Borrero Gonzalez) [10:11:12] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [10:11:27] (03CR) 10Jbond: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [10:12:06] 10Operations: Extend firewall rules for new corp LDAP replicas - https://phabricator.wikimedia.org/T234047 (10MoritzMuehlenhoff) 05Open→03Invalid Closing, this turned out to be not needed in the end. [10:12:27] 10Operations, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10TheDJ) Question. https://wikitech.wikimedia.org/wiki/HTTPS/Browser_Recommendations Windows 7: I know it CAN support TLS 1.2, but I can't figure out if Microsoft released a patch to... [10:13:52] !log stopping slapd on dubnium/pollux following application of the spare role T224557 [10:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:58] T224557: Migrate ldap/corp replicas to Stretch/Buster - https://phabricator.wikimedia.org/T224557 [10:17:45] (03CR) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [10:19:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] ganeti: assign ganeti400[123] role::ganeti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555761 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [10:22:32] 10Operations, 10Traffic: Monitor and plot TTFB as seen by Varnish frontends - https://phabricator.wikimedia.org/T240180 (10ema) 05Open→03Resolved a:03ema Done: https://grafana.wikimedia.org/d/7-ZqK8-Wz/varnish-frontend-ttfb-comparison?orgId=1&from=now-15m&to=now [10:35:00] !log Upgrade db1127 [10:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:14] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1127" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556157 [10:36:46] !log Optimize wikishared.cx_corpora on db2115 (non compressed table) - T183485 [10:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:52] T183485: Please consider purging/moving the cx_corpora table at x1 - https://phabricator.wikimedia.org/T183485 [10:41:00] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [10:44:20] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1127" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556157 (owner: 10Marostegui) [10:45:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1127" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556157 (owner: 10Marostegui) [10:45:50] (03CR) 10Jbond: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [10:48:03] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10hashar) `contint1001` has 4 1TB SSD: ` # lshw -class disk -short H/W path... [10:48:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1127 after table defragmentation (duration: 00m 55s) [10:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:41] 10Operations, 10Puppet, 10serviceops, 10User-jbond: Rolling restart of etcd to pick up the renewed CA public certificate. - https://phabricator.wikimedia.org/T237362 (10Joe) Good news is we only need to do a rolling restart in eqiad, not in codfw, where we still don't use the ca for peer connections [10:55:37] <_joe_> !log restarting etcd on conf1004 T237362 [10:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:43] T237362: Rolling restart of etcd to pick up the renewed CA public certificate. - https://phabricator.wikimedia.org/T237362 [10:58:06] <_joe_> talking about things that don't reconnect cleanly [10:58:10] <_joe_> enter pybal [10:59:10] <_joe_> !log restarting pybal on lvs1016, the the other eqiad pybals, to catch up on etcd restart [10:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:14] PROBLEM - Check systemd state on mw1296 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:52] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 14 connections established with conf1004.eqiad.wmnet:4001 (min=61) https://wikitech.wikimedia.org/wiki/PyBal [11:01:52] PROBLEM - PyBal connections to etcd on lvs1013 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [11:03:34] <_joe_> yeah known [11:04:24] <_joe_> !log restarting pybal on lvs1015, then 1013 and 1014 to pick up the etcd restart [11:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:58] PROBLEM - PyBal connections to etcd on lvs1014 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=24) https://wikitech.wikimedia.org/wiki/PyBal [11:06:20] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw1296 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:07:04] <_joe_> effie: ^^ this is troubling [11:07:38] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 61 connections established with conf1004.eqiad.wmnet:4001 (min=61) https://wikitech.wikimedia.org/wiki/PyBal [11:07:38] RECOVERY - PyBal connections to etcd on lvs1013 is OK: OK: 12 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [11:10:24] 10Operations, 10Traffic, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) [11:10:32] (03PS1) 10Giuseppe Lavagetto: php-check-and-restart.sh: use full path of php7adm [puppet] - 10https://gerrit.wikimedia.org/r/556165 [11:10:37] 10Operations, 10Traffic, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) p:05Triage→03Normal [11:10:42] RECOVERY - PyBal connections to etcd on lvs1014 is OK: OK: 24 connections established with conf1004.eqiad.wmnet:4001 (min=24) https://wikitech.wikimedia.org/wiki/PyBal [11:11:08] 10Operations, 10Traffic, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) [11:13:11] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/556165 (owner: 10Giuseppe Lavagetto) [11:13:29] 10Operations, 10Analytics, 10SRE-Access-Requests: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10jcrespo) @Nuria See original request at T226204#5279623 where @Ottomata suggested this group but was not added. Is this something you approve, as an addendum to the orig... [11:13:51] 10Operations, 10Traffic, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) [11:14:55] (03PS2) 10Giuseppe Lavagetto: php-check-and-restart.sh: use full path of php7adm, remove quotes [puppet] - 10https://gerrit.wikimedia.org/r/556165 [11:16:39] 10Operations, 10DBA: backup2001 rebooted itself - https://phabricator.wikimedia.org/T240177 (10jcrespo) p:05Triage→03Normal a:03jcrespo [11:17:45] 10Operations, 10Analytics, 10SRE-Access-Requests: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10elukey) To add some context, this was originated by me not finding the user among analytics-privatedata-users when trying to add kerberos credentials. All the members of... [11:18:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php-check-and-restart.sh: use full path of php7adm, remove quotes [puppet] - 10https://gerrit.wikimedia.org/r/556165 (owner: 10Giuseppe Lavagetto) [11:24:23] (03PS1) 10Jcrespo: admin: Add accraze to analytics-privadata-users [puppet] - 10https://gerrit.wikimedia.org/r/556168 (https://phabricator.wikimedia.org/T240243) [11:25:08] RECOVERY - Check systemd state on mw1296 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:25] (03CR) 10Jcrespo: [C: 04-2] "Not to be merged until approval: https://phabricator.wikimedia.org/T240243" [puppet] - 10https://gerrit.wikimedia.org/r/556168 (https://phabricator.wikimedia.org/T240243) (owner: 10Jcrespo) [11:26:28] RECOVERY - snapshot of s4 in codfw on db1115 is OK: snapshot for s4 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-12-10 09:55:00 from db2099.codfw.wmnet:3314 (1114 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [11:26:32] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10jcrespo) ^I have prepared the patch to merge it as soon as everybody agrees. [11:27:13] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10jcrespo) a:03Nuria Please reassign to me when ok or if there are comments. [11:27:45] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@452b144] (stretch): Update kartotherian-package to f9fb029 (T240227) [11:27:50] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw1296 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:52] T240227: [Bug] populate_admin: ERROR: Relate Operation called with a LWGEOMCOLLECTION type - https://phabricator.wikimedia.org/T240227 [11:28:06] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@452b144] (stretch): Update kartotherian-package to f9fb029 (T240227) (duration: 00m 20s) [11:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:18] onimisionipe: ^ [11:28:23] Ready for test [11:29:16] RECOVERY - Check systemd state on mw1298 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:29] !log rolloing restart of ats servers [11:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:48] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: (2019-09) Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10jcrespo) Hey, @chasemp, is this in your radar (lot of time passed since last update)? If yes, but "there is need of s... [11:36:30] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw1298 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:37:48] 10Operations, 10Pybal, 10SRE-tools, 10Traffic, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10akosiaris) >>! In T239392#5701649, @akosiaris wrote: > `need to be able to understand the... [11:40:54] 10Operations, 10DNS, 10Research, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10jcrespo) a:03leila Assigning to @leila as per BBlack and Reedy comments, as there seems to be some additional information required. Please feel free to reassign to the r... [11:43:23] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10mark) I agree - it seems that PyBal adds no real value here, because it's essentially load balancing the k8s load balanc... [11:46:09] 10Operations, 10DBA: backup2001 rebooted itself - https://phabricator.wikimedia.org/T240177 (10jcrespo) a:05jcrespo→03Papaul Not the first time this happens: T237730 And firmware was updated at that time. @Papaul Could you file a support issue to vendor, given it is the second time this happened? What inf... [11:46:55] <_joe_> !log restarting etcd on conf1005, also etcdmirrormaker [11:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:48] (03CR) 10Ladsgroup: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/553097 (https://phabricator.wikimedia.org/T238751) (owner: 10Alaa Sarhan) [11:55:23] (03CR) 10BBlack: [C: 03+2] dnsrecursor: rename ulimits to override [puppet] - 10https://gerrit.wikimedia.org/r/556084 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [11:55:26] (03CR) 10BBlack: [C: 03+2] dnsrecursor: add parameter bind_service [puppet] - 10https://gerrit.wikimedia.org/r/556085 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [11:55:30] (03CR) 10BBlack: [C: 03+2] dnsbox: replace glue with bind_service [puppet] - 10https://gerrit.wikimedia.org/r/556086 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [11:55:34] (03CR) 10BBlack: [C: 03+2] dnsbox: eliminate extra profile layer [puppet] - 10https://gerrit.wikimedia.org/r/556087 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [11:58:19] (03PS1) 10Filippo Giunchedi: logstash: add explicit IDs to plugins [puppet] - 10https://gerrit.wikimedia.org/r/556173 (https://phabricator.wikimedia.org/T215904) [11:59:41] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) >>! In T238909#5727644, @mark wrote: > I agree - it seems that PyBal adds no real value here, because it's es... [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191210T1200). Please do the needful. [12:00:04] Nikerabbit: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:39] ~o/ [12:01:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:02:31] o/ [12:02:42] oh, but I can’t SWAT, I have to run, sorry [12:02:46] hope you find someone else :/ [12:03:29] let's see [12:04:08] Nikerabbit: I can SWAT! [12:04:15] (or you can push your code to prod yourslef) [12:04:43] Urbanecm: I haven't done that in years, but I was already eyeing the docs [12:05:07] Nikerabbit: the decision is yours, I'm fine with deploying it for you :) [12:05:55] (03PS3) 10Urbanecm: Add 'wiki-for-human-rights' CX campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555921 (https://phabricator.wikimedia.org/T239977) (owner: 10KartikMistry) [12:06:23] Urbanecm: I can try, but please be around when I break something [12:06:34] Nikerabbit: sure! [12:08:05] Urbanecm: so I start with +2, right? [12:08:09] Nikerabbit: yes [12:08:17] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) >>! In T238909#5727698, @mark wrote: >>>! In T238909#5727693, @akosiaris wrote: > >> True. We could investig... [12:08:36] (03CR) 10Nikerabbit: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555921 (https://phabricator.wikimedia.org/T239977) (owner: 10KartikMistry) [12:08:42] (03CR) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [12:09:31] (03Merged) 10jenkins-bot: Add 'wiki-for-human-rights' CX campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555921 (https://phabricator.wikimedia.org/T239977) (owner: 10KartikMistry) [12:10:52] Urbanecm: now fetching it [12:10:57] Nikerabbit: ack [12:11:48] Urbanecm: and now I'm supposed to pull it on mwdebug1002? [12:11:59] Nikerabbit: mwdebug1001 [12:12:05] mwdebug1002 is flaky few last days [12:12:13] Urbanecm: gotcha [12:12:44] Urbanecm: just `scap pull`? will it take long? [12:12:54] Nikerabbit: yes, scap pull. It should be fast [12:13:28] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:13:50] fast indeed, now doing manual test [12:13:55] Nikerabbit: ack [12:13:56] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:21] Urbanecm: my manual test was success. do you see anything that would prevent continuing with deployment [12:15:33] looking [12:15:54] I'd say we're fine! [12:16:09] ok, will sync [12:16:27] k [12:16:33] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:19:13] !log nikerabbit@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:555921|Add wiki-for-human-rights CX campaign (T239977)]] (duration: 00m 56s) [12:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:19] T239977: Create URL campaign for Wiki for Human Rights in Content Translation - https://phabricator.wikimedia.org/T239977 [12:21:20] Nikerabbit: seems it's done? [12:22:18] Urbanecm: yeah, though in manual testing with x-debug I'm still trying to get it work [12:23:04] Nikerabbit: great [12:23:19] (03CR) 10Urbanecm: [C: 03+2] Enable abusefilter blocking cap at testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554482 (owner: 10Urbanecm) [12:23:41] Urbanecm: I don't understand why it doesn't work without. assuming I synced the right file [12:24:05] (03Merged) 10jenkins-bot: Enable abusefilter blocking cap at testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554482 (owner: 10Urbanecm) [12:25:25] Nikerabbit: how can one see if it works? [12:26:13] Urbanecm: for example, go to https://en.wikipedia.org/wiki/Special:CXStats and see that wiki-for-human-rights is included in mw.config.values.wgContentTranslationCampaigns [12:27:10] something like this? https://usercontent.irccloud-cdn.com/file/Ps1J9Vjc/image.png [12:27:18] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10herron) @elukey hey, yes that's been fixed by making a newer version of curator available to the new clusters. Haven't seen cron errors from these since Dec 5. Thanks for cleaning up the "config does not exis... [12:27:57] Urbanecm: yeah, so you see it? [12:28:08] Nikerabbit: yes. I'd blame cache, since it's javascript. [12:28:38] Urbanecm: hmm... I just know what cache there would be on a dynamic page [12:30:09] Nikerabbit: Does it work for you if you load the page with ?debug=1? [12:30:57] (03PS2) 10BBlack: lvs recdns: decom lvs-specific parts [puppet] - 10https://gerrit.wikimedia.org/r/555537 (https://phabricator.wikimedia.org/T239993) [12:30:59] (03PS2) 10BBlack: lvs recdns: clean up realserver def [puppet] - 10https://gerrit.wikimedia.org/r/555538 (https://phabricator.wikimedia.org/T239993) [12:31:01] (03PS1) 10BBlack: lvs recdns: eqiad and codfw keep old addr, for now [puppet] - 10https://gerrit.wikimedia.org/r/556177 (https://phabricator.wikimedia.org/T239993) [12:31:03] (03PS1) 10BBlack: lvs recdns: remove legacy IP definition, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/556178 (https://phabricator.wikimedia.org/T239993) [12:31:05] (03PS1) 10BBlack: lvs recdns: remove legacy IP definition, step 2 [puppet] - 10https://gerrit.wikimedia.org/r/556179 (https://phabricator.wikimedia.org/T239993) [12:31:07] Urbanecm: are you sure you don't have x-debug enabled? [12:31:56] (03CR) 10jerkins-bot: [V: 04-1] lvs recdns: clean up realserver def [puppet] - 10https://gerrit.wikimedia.org/r/555538 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [12:32:03] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 80fac66: Enable abusefilter blocking cap at testwiki (duration: 00m 55s) [12:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:08] no work for me with debug=1 nor with incognito mode [12:32:40] well now I see it... perhaps you second sync worked [12:33:13] Nikerabbit: I didn't have x-debug enabled. Well, glad it works then :-) [12:33:24] (03CR) 10jerkins-bot: [V: 04-1] lvs recdns: eqiad and codfw keep old addr, for now [puppet] - 10https://gerrit.wikimedia.org/r/556177 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [12:33:32] (but I tested en.wiki, maybe it mattered for some reaon) [12:33:34] *cs [12:33:46] (03CR) 10jerkins-bot: [V: 04-1] lvs recdns: remove legacy IP definition, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/556178 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [12:34:13] I don't know, I tested en/fi/de wikis with same issue [12:35:55] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:35:58] (03PS2) 10BBlack: lvs recdns: eqiad and codfw keep old addr, for now [puppet] - 10https://gerrit.wikimedia.org/r/556177 (https://phabricator.wikimedia.org/T239993) [12:36:00] (03PS3) 10BBlack: lvs recdns: clean up realserver def [puppet] - 10https://gerrit.wikimedia.org/r/555538 (https://phabricator.wikimedia.org/T239993) [12:36:02] (03PS2) 10BBlack: lvs recdns: remove legacy IP definition, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/556178 (https://phabricator.wikimedia.org/T239993) [12:36:04] (03PS2) 10BBlack: lvs recdns: remove legacy IP definition, step 2 [puppet] - 10https://gerrit.wikimedia.org/r/556179 (https://phabricator.wikimedia.org/T239993) [12:36:29] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: SWAT: 80fac66: Enable abusefilter blocking cap at testwiki (duration: 00m 55s) [12:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:47] !log EU SWAT done [12:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:05] (03CR) 10jerkins-bot: [V: 04-1] lvs recdns: clean up realserver def [puppet] - 10https://gerrit.wikimedia.org/r/555538 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [12:37:22] Urbanecm: thanks! [12:37:28] happy to help! [12:38:02] (03CR) 10jerkins-bot: [V: 04-1] lvs recdns: eqiad and codfw keep old addr, for now [puppet] - 10https://gerrit.wikimedia.org/r/556177 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [12:38:41] (03CR) 10jerkins-bot: [V: 04-1] lvs recdns: remove legacy IP definition, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/556178 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [12:39:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:39:35] (03PS1) 10Ladsgroup: Offboard Alaa [puppet] - 10https://gerrit.wikimedia.org/r/556180 [12:40:36] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:41:20] (03CR) 10jerkins-bot: [V: 04-1] Offboard Alaa [puppet] - 10https://gerrit.wikimedia.org/r/556180 (owner: 10Ladsgroup) [12:44:16] (03PS3) 10BBlack: lvs recdns: eqiad and codfw keep old addr, for now [puppet] - 10https://gerrit.wikimedia.org/r/556177 (https://phabricator.wikimedia.org/T239993) [12:44:18] (03PS4) 10BBlack: lvs recdns: clean up realserver def [puppet] - 10https://gerrit.wikimedia.org/r/555538 (https://phabricator.wikimedia.org/T239993) [12:44:20] (03PS3) 10BBlack: lvs recdns: remove legacy IP definition, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/556178 (https://phabricator.wikimedia.org/T239993) [12:44:22] (03PS3) 10BBlack: lvs recdns: remove legacy IP definition, step 2 [puppet] - 10https://gerrit.wikimedia.org/r/556179 (https://phabricator.wikimedia.org/T239993) [12:44:24] (03PS2) 10Ladsgroup: Offboard Alaa [puppet] - 10https://gerrit.wikimedia.org/r/556180 [12:50:32] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:54:55] (03PS5) 10Phamhi: wmcs: use hiera for labmon/cloudmetrics instead of harcoded values [puppet] - 10https://gerrit.wikimedia.org/r/554853 (https://phabricator.wikimedia.org/T224585) (owner: 10Arturo Borrero Gonzalez) [12:56:34] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:56:53] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:57:25] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:58:19] (03CR) 10Phamhi: wmcs: use hiera for labmon/cloudmetrics instead of harcoded values (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554853 (https://phabricator.wikimedia.org/T224585) (owner: 10Arturo Borrero Gonzalez) [13:10:13] (03PS1) 10BBlack: dnsrecursor: remove redundant listen for prod case [puppet] - 10https://gerrit.wikimedia.org/r/556183 (https://phabricator.wikimedia.org/T240285) [13:14:07] (03CR) 10BBlack: [C: 03+2] dnsrecursor: remove redundant listen for prod case [puppet] - 10https://gerrit.wikimedia.org/r/556183 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [13:15:37] I'd like to reserve a 1-hour deployment window beginning... ASAP. This is to fix a wmf.8 and potentially wmf.10 train blocker. [13:15:42] (03CR) 10Filippo Giunchedi: [C: 03+2] "Haven't checked all dashboard IDs but PCC looks good https://puppet-compiler.wmflabs.org/compiler1003/19866/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/556108 (https://phabricator.wikimedia.org/T198287) (owner: 10Phedenskog) [13:18:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Please, get filippo +1 as well." [puppet] - 10https://gerrit.wikimedia.org/r/554853 (https://phabricator.wikimedia.org/T224585) (owner: 10Arturo Borrero Gonzalez) [13:21:54] !log Compress table db2115 wikishared.cx_corpora on db2115 - T240325 [13:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:00] T240325: Compress wikisahred.cx_corpora on x1 hosts - https://phabricator.wikimedia.org/T240325 [13:24:22] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: set up 3 nginx-ingress pod replicas [puppet] - 10https://gerrit.wikimedia.org/r/556184 (https://phabricator.wikimedia.org/T239405) [13:26:12] 10Operations, 10DNS, 10Traffic: redirect wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Bugreporter) [13:26:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554853 (https://phabricator.wikimedia.org/T224585) (owner: 10Arturo Borrero Gonzalez) [13:27:58] (03PS3) 10BBlack: lvs recdns: decom lvs-specific parts [puppet] - 10https://gerrit.wikimedia.org/r/555537 (https://phabricator.wikimedia.org/T239993) [13:28:00] (03PS4) 10BBlack: lvs recdns: eqiad and codfw keep old addr, for now [puppet] - 10https://gerrit.wikimedia.org/r/556177 (https://phabricator.wikimedia.org/T239993) [13:28:02] (03PS5) 10BBlack: lvs recdns: clean up realserver def [puppet] - 10https://gerrit.wikimedia.org/r/555538 (https://phabricator.wikimedia.org/T239993) [13:28:10] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [13:28:18] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) * `puppetdb` and postgresql restarted on puppetdb[12]002 * `postgresql` restarted on netbox[12]001 * `stunnel4` service restarted on all service... [13:29:26] (03PS1) 10Ema: ATS: improve session/token match [puppet] - 10https://gerrit.wikimedia.org/r/556185 (https://phabricator.wikimedia.org/T227432) [13:31:07] 10Operations, 10DNS, 10Traffic: redirect wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Reedy) Or just stop using it completely? ;) [13:31:25] 04Critical Alert for device asw2-esams.mgmt.esams.wmnet - Primary outbound port utilisation over 80% [13:32:52] (03CR) 10Phamhi: [C: 03+2] wmcs: use hiera for labmon/cloudmetrics instead of harcoded values [puppet] - 10https://gerrit.wikimedia.org/r/554853 (https://phabricator.wikimedia.org/T224585) (owner: 10Arturo Borrero Gonzalez) [13:34:21] (03CR) 10BBlack: [C: 03+1] ATS: improve session/token match [puppet] - 10https://gerrit.wikimedia.org/r/556185 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:34:57] 10Operations, 10DNS, 10Traffic: redirect wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Ammarpad) May be the question to ask is do we //really// need that? [13:35:39] 10Operations, 10Traffic, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) ema, confirmed that the traffic servers did need a restart which has now been preformed [13:35:45] 10Operations, 10Traffic, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) 05Open→03Resolved [13:35:48] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [13:41:26] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-esams.mgmt.esams.wmnet recovered from Primary outbound port utilisation over 80% [13:41:45] 10Operations, 10DNS, 10Traffic: redirect wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Bugreporter) I don't know any way to get the number of tried resolutions (or views) of a non-existent domain. [13:43:36] (03CR) 10Phamhi: wmcs: make cloudmetrics1002 the primary instead of labmon1001 (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:49:04] (03PS6) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) [13:51:06] (03CR) 10Ema: [C: 03+2] ATS: improve session/token match [puppet] - 10https://gerrit.wikimedia.org/r/556185 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:52:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new k8s: set up 3 nginx-ingress pod replicas [puppet] - 10https://gerrit.wikimedia.org/r/556184 (https://phabricator.wikimedia.org/T239405) (owner: 10Arturo Borrero Gonzalez) [13:54:05] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [13:54:23] !log remove stale puppetmaster2001:/var/run/confd-template/.*.err [13:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:31] (03PS1) 10Elukey: profile::hadoop::worker: add set_yarn_dir_ownership script [puppet] - 10https://gerrit.wikimedia.org/r/556190 (https://phabricator.wikimedia.org/T237269) [13:58:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [14:04:55] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [14:05:36] 10Operations, 10observability, 10Availability, 10Patch-For-Review: Monitor MediaWiki sessions - https://phabricator.wikimedia.org/T108985 (10fgiunchedi) 05Open→03Resolved This was done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/350555 [14:07:41] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1003 - https://phabricator.wikimedia.org/T239217 (10Marostegui) Disk replaced by John and I can see it rebuilding: ` root@dbstore1003:~# megacli -PDRbld -ShowProg -physdrv[32:4] -aALL Rebuild Progress on Device at Enclosure 32, Slot 4 Completed... [14:08:12] 10Operations, 10observability: non sms alternatives - https://phabricator.wikimedia.org/T114651 (10fgiunchedi) 05Open→03Invalid We'll indeed be investigating non-SMS alternatives as a requirement for [[ https://docs.google.com/document/d/1zue1Fxaaf_vaiyE8E2pJgOIyw0jKPVsfROUDYAwZwsM/edit# | pages escalation... [14:08:43] !log rolling restart of varnishkafaka-webrequest and varnishkafaka-eventloggin [14:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:10:55] (03PS2) 10Alexandros Kosiaris: Add kubetcd[12]00[456], kubestagetcd100[456] [dns] - 10https://gerrit.wikimedia.org/r/554561 (https://phabricator.wikimedia.org/T239838) [14:11:54] 10Operations, 10observability, 10Tracking-Neverending: Improve access to and control over incident and metrics monitoring infrastructure - https://phabricator.wikimedia.org/T124179 (10fgiunchedi) 05Open→03Declined Declining as these points are covered by the [[ https://docs.google.com/document/d/1cpoPUBu... [14:11:58] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [14:12:07] 10Operations, 10observability: Evaluate alternative web interfaces to icinga 1 core - https://phabricator.wikimedia.org/T124185 (10fgiunchedi) 05Open→03Declined Declining as these points are covered by the [[ https://docs.google.com/document/d/1cpoPUBuKoo9y-oygZcwQ4pybHM6L7yo2F2TPrU2EIbQ/edit#heading=h.vyt... [14:12:09] 10Operations, 10observability, 10Tracking-Neverending: Improve access to and control over incident and metrics monitoring infrastructure - https://phabricator.wikimedia.org/T124179 (10fgiunchedi) [14:12:15] (03PS3) 10Muehlenhoff: Offboard Alaa [puppet] - 10https://gerrit.wikimedia.org/r/556180 (owner: 10Ladsgroup) [14:14:22] 10Operations, 10Release-Engineering-Team-TODO, 10observability, 10Release-Engineering-Team (Deployment services): "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I believe... [14:14:28] (03PS4) 10Muehlenhoff: Offboard Alaa [puppet] - 10https://gerrit.wikimedia.org/r/556180 (owner: 10Ladsgroup) [14:15:02] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1003 - https://phabricator.wikimedia.org/T239217 (10Jclark-ctr) 05Open→03Resolved Replaced Failed Drive [14:15:35] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:18:07] (03PS1) 10Jcrespo: Revert "Revert "Revert "bacula: Schedule hourly copies of production backups to the offsite pool""" [puppet] - 10https://gerrit.wikimedia.org/r/556192 [14:18:09] (03CR) 10Muehlenhoff: [C: 03+2] Offboard Alaa [puppet] - 10https://gerrit.wikimedia.org/r/556180 (owner: 10Ladsgroup) [14:18:51] (03PS3) 10Alexandros Kosiaris: Add kubetcd[12]00[456], kubestagetcd100[456] [dns] - 10https://gerrit.wikimedia.org/r/554561 (https://phabricator.wikimedia.org/T239838) [14:20:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add kubetcd[12]00[456], kubestagetcd100[456] [dns] - 10https://gerrit.wikimedia.org/r/554561 (https://phabricator.wikimedia.org/T239838) (owner: 10Alexandros Kosiaris) [14:20:24] 10Operations, 10ops-eqiad, 10Analytics: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 (10elukey) 05Open→03Resolved I have set puppet to check for WriteThrough, not WriteBack, so alarms will go away. This host will be refreshed during the next months. [14:21:00] (03CR) 10Ottomata: [C: 03+1] profile::hadoop::worker: add set_yarn_dir_ownership script [puppet] - 10https://gerrit.wikimedia.org/r/556190 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [14:22:33] (03PS7) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) [14:24:27] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:26:02] (03CR) 10Jhedden: aptrepo: add ceph nautilus repo for cloudvps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) (owner: 10Jhedden) [14:26:11] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:26:33] PROBLEM - bacula director process on backup1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula [14:28:21] RECOVERY - bacula director process on backup1001 is OK: PROCS OK: 1 process with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula [14:29:43] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:29:45] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:30:40] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [14:30:40] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [14:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:38] (03CR) 10Jhedden: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) (owner: 10Jhedden) [14:32:27] (03PS1) 10Jcrespo: bacula: Setup weekly copy migrations until a first run happens [puppet] - 10https://gerrit.wikimedia.org/r/556195 (https://phabricator.wikimedia.org/T238048) [14:32:41] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [14:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:41] (03CR) 10Jcrespo: [C: 03+2] bacula: Setup weekly copy migrations until a first run happens [puppet] - 10https://gerrit.wikimedia.org/r/556195 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [14:33:50] (03PS2) 10Jcrespo: bacula: Setup weekly copy migrations until a first run happens [puppet] - 10https://gerrit.wikimedia.org/r/556195 (https://phabricator.wikimedia.org/T238048) [14:33:54] (03CR) 10Herron: "Nice idea! Looks good to me overall. IMO we should stick to a convention for the ids, e.g. type/plugin/action/description or similar. I s" [puppet] - 10https://gerrit.wikimedia.org/r/556173 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [14:34:06] 10Operations, 10observability: Allow customizing the alert message from graphite - https://phabricator.wikimedia.org/T95801 (10fgiunchedi) 05Open→03Declined We're getting rid of graphite and graphite-based alerts over time, declining but please reopen if needed! [14:39:01] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) I have now updated the relevant CA files in the private repo [14:39:06] 10Operations, 10observability: Monitoring: add link to graph for Icinga timeseries alarms - https://phabricator.wikimedia.org/T167422 (10fgiunchedi) We have `dashboard_links` and `notes_link` now for `check_prometheus` and `monitoring::service` enforces the presence of `notes_url`. @volans is there anything el... [14:40:28] 10Operations, 10observability: Export ipsec counters as Prometheus metrics - https://phabricator.wikimedia.org/T154619 (10fgiunchedi) 05Open→03Declined Given we're moving off ipsec for most/all use cases I'm boldly declining [14:41:29] PROBLEM - Host stat1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:41:52] (03CR) 10Muehlenhoff: [C: 03+2] Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [14:42:24] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [14:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:32] 10Operations, 10DNS, 10Traffic: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Aklapper) [14:43:03] 10Operations, 10DNS, 10Traffic: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Aklapper) @Bugreporter: What does a namespace on a wiki have to do with a subdomain? [14:43:13] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [14:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:03] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2007 is OK: HTTP OK: HTTP/1.1 200 Ok - 34367 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:44:11] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/556173 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [14:45:45] 10Operations, 10observability: Monitoring: add link to graph for Icinga timeseries alarms - https://phabricator.wikimedia.org/T167422 (10Volans) That's great. The idea of the task was to link the specific dashboard that has the same data, while sometimes we use data that is not showed on grafana at all or we l... [14:47:21] RECOVERY - Host stat1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [14:49:12] jouncebot: now [14:49:12] No deployments scheduled for the next 2 hour(s) and 10 minute(s) [14:49:13] jouncebot: next [14:49:14] In 2 hour(s) and 10 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191210T1700) [14:49:33] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:49:45] (03PS1) 10Ema: ATS: test setup for default.lua [puppet] - 10https://gerrit.wikimedia.org/r/556197 (https://phabricator.wikimedia.org/T227432) [14:51:53] (03PS5) 10Reedy: wikitech: remove OSM settings related to OpenStack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549936 (https://phabricator.wikimedia.org/T161553) (owner: 10Andrew Bogott) [14:51:59] (03CR) 10Reedy: [C: 03+2] wikitech: remove OSM settings related to OpenStack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549936 (https://phabricator.wikimedia.org/T161553) (owner: 10Andrew Bogott) [14:52:48] (03Merged) 10jenkins-bot: wikitech: remove OSM settings related to OpenStack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549936 (https://phabricator.wikimedia.org/T161553) (owner: 10Andrew Bogott) [14:52:50] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [14:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:25] 10Operations, 10DNS, 10Traffic: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10jcrespo) I am just here doing clinic duty for the #operations tag. #traffic should decide on this ticket, but based on my (limited) understanding of our... [14:53:30] (03CR) 10Ema: [C: 03+2] ATS: test setup for default.lua [puppet] - 10https://gerrit.wikimedia.org/r/556197 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:53:57] (03PS2) 10Filippo Giunchedi: logstash: add explicit IDs to plugins [puppet] - 10https://gerrit.wikimedia.org/r/556173 (https://phabricator.wikimedia.org/T215904) [14:54:58] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10jlinehan) >>! In T236386#5725166, @Ottomata wrote: > @jlinehan thoughts? I'm considering moving forward with intake-{analyt... [14:55:46] !log reedy@deploy1001 Synchronized wmf-config/wikitech.php: T161553 Bye OSM config! (duration: 00m 55s) [14:55:50] (03CR) 10Elukey: [C: 03+2] profile::hadoop::worker: add set_yarn_dir_ownership script [puppet] - 10https://gerrit.wikimedia.org/r/556190 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [14:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:58] T161553: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553 [14:55:59] (03PS2) 10Reedy: Use extension.json in extension-list for LdapAuthentication and OSM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550037 [14:56:31] (03PS3) 10Reedy: Use extension.json in extension-list for LdapAuthentication and OSM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550037 (https://phabricator.wikimedia.org/T139800) [14:57:10] (03CR) 10Reedy: [C: 03+2] Use extension.json in extension-list for LdapAuthentication and OSM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550037 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [14:57:57] (03Merged) 10jenkins-bot: Use extension.json in extension-list for LdapAuthentication and OSM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550037 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [14:59:47] (03CR) 10Andrew Bogott: [C: 03+1] "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549936 (https://phabricator.wikimedia.org/T161553) (owner: 10Andrew Bogott) [14:59:52] !log reedy@deploy1001 Synchronized wmf-config/extension-list: Load OSM and LdapAuth via extension.json for messages (duration: 00m 55s) [14:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:58] (03PS2) 10Reedy: Use wfLoadExtension() for LdapAuthentication and OSM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550038 (https://phabricator.wikimedia.org/T140852) [15:01:31] !log installing systemd updates from stretch 9.11 point release [15:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:47] (03CR) 10Reedy: [C: 03+2] Use wfLoadExtension() for LdapAuthentication and OSM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550038 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [15:02:34] (03Merged) 10jenkins-bot: Use wfLoadExtension() for LdapAuthentication and OSM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550038 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [15:03:54] !log reedy@deploy1001 Synchronized wmf-config/wikitech.php: Load OSM and LdapAuth via extension.json T140852 (duration: 00m 55s) [15:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:00] T140852: Load all Wikimedia-deployed extensions and skins via extension registration - https://phabricator.wikimedia.org/T140852 [15:04:22] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [15:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:54] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: use IPv6 in Hive [puppet] - 10https://gerrit.wikimedia.org/r/556198 (https://phabricator.wikimedia.org/T240255) [15:09:17] (03PS1) 10Filippo Giunchedi: monitoring: fail on single-quote Prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/556200 (https://phabricator.wikimedia.org/T188917) [15:09:27] (03PS1) 10Ema: ATS: use set_server_resp_no_store, do not hide CC [puppet] - 10https://gerrit.wikimedia.org/r/556201 (https://phabricator.wikimedia.org/T227432) [15:10:12] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: use IPv6 in Hive [puppet] - 10https://gerrit.wikimedia.org/r/556198 (https://phabricator.wikimedia.org/T240255) (owner: 10Elukey) [15:11:23] 10Operations, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Ahecht) Currently, the user experience for someone seeing an error message such as the one at https://en.wikipedia.org/sec-warning is quite poor. To find out what they actually have... [15:12:23] (03CR) 10Ottomata: role::analytics_test_cluster::coordinator: use IPv6 in Hive (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556198 (https://phabricator.wikimedia.org/T240255) (owner: 10Elukey) [15:13:57] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:14:00] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) > Changing the URL is easy is not really that easy :/ Possible though. Ooook. [15:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:28] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [15:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:02] (03PS1) 10Alexandros Kosiaris: k8s: Introduce kubetcd[12]00[456], kubestagetcd100[456] [puppet] - 10https://gerrit.wikimedia.org/r/556202 (https://phabricator.wikimedia.org/T239838) [15:15:05] PROBLEM - DPKG on analytics1064 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:15:54] (03CR) 10jerkins-bot: [V: 04-1] k8s: Introduce kubetcd[12]00[456], kubestagetcd100[456] [puppet] - 10https://gerrit.wikimedia.org/r/556202 (https://phabricator.wikimedia.org/T239838) (owner: 10Alexandros Kosiaris) [15:16:13] PROBLEM - DPKG on analytics1063 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:16:14] PROBLEM - DPKG on analytics1067 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:16:16] ^analytics1064 is the systemd update [15:16:33] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: use IPv6 in Hive (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556198 (https://phabricator.wikimedia.org/T240255) (owner: 10Elukey) [15:16:51] RECOVERY - DPKG on analytics1064 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:17:59] RECOVERY - DPKG on analytics1067 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:17:59] RECOVERY - DPKG on analytics1063 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:24:03] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:23] 10Operations, 10Goal: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) Copy jobs are running now- we will see how much it takes to do a full copy. I setup for now copies to happen only every week because if I setup to do it every hour, bacul... [15:24:53] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [15:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:40] (03PS1) 10Jbond: profile::backup::director: increase number of open files. [puppet] - 10https://gerrit.wikimedia.org/r/556207 [15:32:03] (03PS3) 10Jhedden: aptrepo: add ceph nautilus repo for cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) [15:33:37] (03PS4) 10Jhedden: aptrepo: add ceph nautilus repo for cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) [15:34:06] (03CR) 10Jhedden: aptrepo: add ceph nautilus repo for cloudvps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) (owner: 10Jhedden) [15:34:13] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:40] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [15:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:33] (03PS2) 10Jbond: profile::backup::director: increase number of open files. [puppet] - 10https://gerrit.wikimedia.org/r/556207 [15:41:45] Reedy: brennen: Hi, we have a test and a fix for the wmf.8 logspam. Does it make sense for me to deploy it now? [15:42:00] This same fix is also merged to Cite#master [15:42:10] Is it in .10 already too? [15:42:22] Or has .10 not been branched yet? [15:42:24] * Reedy looks [15:42:37] ah no it hasn't [15:42:59] awight: .8 is gonna be on wikis for another couple of days... Seems worthwhile cleaning up the log spam [15:43:11] Or at least, knowing if you need to fix up more stuff too [15:44:12] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:44:13] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [15:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:20] Reedy: cool. Shall I add a window to wikitech:Deployments, or just go for it? [15:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:27] awight: I'd just go for it :) [15:45:08] https://github.com/wikimedia/mediawiki-extensions-Cite/tree/wmf/1.35.0-wmf.8 [15:45:12] This branch is 3 commits ahead, 133 commits behind master. [15:45:12] heh [15:45:22] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [15:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:26] (03PS1) 10Jbond: backup::director: add type checking and use lookup vs hiera [puppet] - 10https://gerrit.wikimedia.org/r/556211 [15:48:14] (03PS1) 10TechneSiyam: Modified files with correct sized logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556212 [15:53:47] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:53:52] (03PS1) 10Muehlenhoff: Complete package list for slice pinning on stat/notebook [puppet] - 10https://gerrit.wikimedia.org/r/556213 [15:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:44] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [15:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:54] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:12] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Anomie) >>! In T107610#5454434, @Catrope wrote: > It doesn't look like `ExternalStoreDB` currently supports over... [15:55:33] 10Operations: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 (10MoritzMuehlenhoff) [15:56:39] !log installing icu updates from stretch 9.11 point release [15:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add IPv6 calico rules for eventgate-logging-external -> kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/554295 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:59:20] (03PS3) 10Alexandros Kosiaris: Add IPv6 calico rules for eventgate-logging-external -> kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/554295 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:59:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Document workaround for certificate issue on macOS [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/555751 (owner: 10Kosta Harlan) [16:00:24] (03PS1) 10TechneSiyam: Modified IS.php with extra hd logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556214 [16:00:35] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [16:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:48] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [16:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:36] (03PS4) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [16:04:02] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [16:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:32] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:04:46] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [16:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:19] (03PS5) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [16:08:24] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:09:25] (03PS2) 10Alexandros Kosiaris: k8s: Introduce kubetcd[12]00[456], kubestagetcd100[456] [puppet] - 10https://gerrit.wikimedia.org/r/556202 (https://phabricator.wikimedia.org/T239838) [16:09:50] Reedy: heads-up, I'm tandem deploying with Andrew-WMDE. [16:11:13] (03PS1) 10Phamhi: wmcs: fix hiera lookup for primary labmon host [puppet] - 10https://gerrit.wikimedia.org/r/556215 (https://phabricator.wikimedia.org/T224585) [16:11:35] (03PS1) 10Bstorm: Revert "comment out sagres.c3sl.ufpr.br from dumps mirrors list" [puppet] - 10https://gerrit.wikimedia.org/r/556216 [16:11:50] !log installing gettext updates from stretch 9.11 point release [16:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:59] (03PS2) 10Bstorm: Revert "comment out sagres.c3sl.ufpr.br from dumps mirrors list" [puppet] - 10https://gerrit.wikimedia.org/r/556216 [16:12:03] (03PS6) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [16:14:01] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:22:39] (03PS2) 10Phamhi: wmcs: fix hiera lookup for primary labmon host [puppet] - 10https://gerrit.wikimedia.org/r/556215 (https://phabricator.wikimedia.org/T224585) [16:23:05] !log cr[12]-codfw: Adding static route for 208.80.153.254 (legacy lvs recdns IP) to dns2002.wikimedia.org - T239993 [16:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:11] T239993: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 [16:24:46] (03PS1) 10Ema: ATS: lookup cache for cookie requests [puppet] - 10https://gerrit.wikimedia.org/r/556217 (https://phabricator.wikimedia.org/T227432) [16:25:23] !log cr[12]-eqiad: Adding static route for 208.80.154.254 (legacy lvs recdns IP) to dns1002.wikimedia.org - T239993 [16:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:10] 10Operations: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 (10MoritzMuehlenhoff) [16:28:22] (03CR) 10BBlack: [C: 03+1] ATS: lookup cache for cookie requests [puppet] - 10https://gerrit.wikimedia.org/r/556217 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [16:28:26] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "My suggestion is to keep the default to localhost (so this can be applied to a VM for testing) and introduce a proper hiera config for the" [puppet] - 10https://gerrit.wikimedia.org/r/556215 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [16:31:42] !log andrew-wmde@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Cite: SWAT: [[gerrit:556186|Fix incomplete cloning of the Parser::$extCite instance (T240248)]] (duration: 01m 04s) [16:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:51] T240248: "PHP Notice: Undefined index: key" and similar in Cite.php and ReferenceStack.php - https://phabricator.wikimedia.org/T240248 [16:33:02] 10Operations, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10TheDJ) @Ahecht this check doesn't care about browsers, because their behavior is not consistent. It only cares about which ACTUAL protocol you are using. Doing user-agents checks for... [16:33:50] (03CR) 10Bstorm: [C: 04-1] "I spoke too soon. They use ipv6 for this, and that isn't resolving still." [puppet] - 10https://gerrit.wikimedia.org/r/556216 (owner: 10Bstorm) [16:34:48] awight: Looks like it's quietened the logs down quite a bit [16:34:58] But still some showing a couple of minutes later [16:35:18] /w/index.php?title=Cut_(Hunters_and_Collectors_album)&action=submit TypeError from line 353 of /srv/mediawiki/php-1.35.0-wmf.8/extensions/Cite/src/ReferenceStack.php: Return value of Cite\ReferenceStack::getGroupRefs() must be of the type array, null returned [16:35:22] Reedy: I'm hoping those are pending edits and api calls [16:35:28] /w/index.php?title=Cut_(Hunters_and_Collectors_album)&action=submit ErrorException from line 353 of /srv/mediawiki/php-1.35.0-wmf.8/extensions/Cite/src/ReferenceStack.php: PHP Notice: Undefined index: [16:35:34] We'll see in a few more minutes? [16:35:44] Indeed [16:35:59] They're certainly not spamming the logs, but could've been transient or mid request etc [16:37:25] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): Degraded RAID on cloudelastic1002 - https://phabricator.wikimedia.org/T239957 (10jcrespo) p:05Triage→03Normal a:03Mathew.onipe Assigning to Mathew based on above update as part of clinic duty. Feel free to revert if this is wr... [16:37:40] !log lvs* + dns*: puppet disabled for lvs recdns decom work - T239993 [16:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:45] T239993: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 [16:38:04] awight: seemingly not [16:38:05] https://en.wikipedia.org/wiki/Cut_(Hunters_and_Collectors_album) [16:38:09] If you do a null edit on that page [16:38:18] "Return value of Cite\ReferenceStack::getGroupRefs() must be of the type array, null returned" [16:38:24] (03CR) 10BBlack: [C: 03+2] lvs recdns: decom lvs-specific parts [puppet] - 10https://gerrit.wikimedia.org/r/555537 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [16:38:31] (03CR) 10BBlack: [C: 03+2] lvs recdns: eqiad and codfw keep old addr, for now [puppet] - 10https://gerrit.wikimedia.org/r/556177 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [16:38:35] And an undefined index [16:38:43] Definitely better, but not completely fixed [16:38:50] Reedy: that is... quite exciting [16:39:15] (03PS3) 10Phamhi: wmcs: fix hiera lookup for primary labmon host [puppet] - 10https://gerrit.wikimedia.org/r/556215 (https://phabricator.wikimedia.org/T224585) [16:39:16] https://en.wikipedia.org/wiki/Special:Contributions/Trappist_the_monk has been busy [16:39:46] ALSO [16:39:46] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#fatal_error [16:39:54] >If I remove {{reflist}} then I can preview the section. Previewing the whole page does not cause the error. [16:41:23] !log lvs400[67] - restarting pybal on high-traffic2 + backup, cleaning old entries for recdns [16:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:14] (03PS4) 10Phamhi: wmcs: fix hiera lookup for primary labmon host [puppet] - 10https://gerrit.wikimedia.org/r/556215 (https://phabricator.wikimedia.org/T224585) [16:45:28] (03CR) 10Elukey: [C: 03+1] Complete package list for slice pinning on stat/notebook [puppet] - 10https://gerrit.wikimedia.org/r/556213 (owner: 10Muehlenhoff) [16:45:40] Reedy: we'll try one more hotfix, this is trivial compared to the original issue... [16:45:47] heh [16:45:58] famous last words [16:46:04] !log lvs300[67] - restarting pybal on high-traffic2 + backup, cleaning old entries for recdns [16:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:21] What can possibly go wrong. [16:46:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I would elaborate a bit more on the commit message about what is this doing. Other than that, this LGTM. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/556215 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [16:47:27] (03CR) 10EBernhardson: Deploy analytics-search keytab to an-airflow (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556072 (owner: 10EBernhardson) [16:47:42] (03CR) 10Phamhi: [C: 03+2] wmcs: fix hiera lookup for primary labmon host [puppet] - 10https://gerrit.wikimedia.org/r/556215 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [16:49:28] (03PS1) 10Volans: docstrings: fix pep257 reported errors [software/spicerack] - 10https://gerrit.wikimedia.org/r/556219 [16:49:30] (03PS1) 10Volans: dnsdisc: use port 5353 to query the resolvers [software/spicerack] - 10https://gerrit.wikimedia.org/r/556220 [16:49:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks! this LGTM. See comment below." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) (owner: 10Jhedden) [16:50:00] !log lvs500[23] - restarting pybal on high-traffic2 + backup, cleaning old entries for recdns [16:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] aptrepo: add ceph nautilus repo for cloudvps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) (owner: 10Jhedden) [16:53:15] (03PS2) 10Volans: dns: allow to specify a custom port [software/spicerack] - 10https://gerrit.wikimedia.org/r/556220 [16:53:17] (03PS1) 10Volans: dnsdisc: use port 5353 to query the resolvers [software/spicerack] - 10https://gerrit.wikimedia.org/r/556222 [16:54:49] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10Nuria) Approved on my end [16:55:31] (03PS1) 10CDanis: dbctl: generate externalLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) [16:56:21] (03PS7) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [16:57:45] (03PS5) 10Jhedden: aptrepo: add ceph nautilus repo for cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) [16:58:15] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:58:33] (03CR) 10Jhedden: aptrepo: add ceph nautilus repo for cloudvps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) (owner: 10Jhedden) [16:59:36] (03CR) 10CRusnov: [C: 03+1] dns: generate DNS snippets from Netbox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554543 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [16:59:43] (03PS6) 10Jhedden: aptrepo: add ceph nautilus repo for cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) [17:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191210T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:47] !log lvs200[25] - restarting pybal on high-traffic2 + backup, cleaning old entries for recdns [17:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:19] (03PS1) 10Phamhi: wmcs: fix hiera lookup for primary labmon host (typo correction) [puppet] - 10https://gerrit.wikimedia.org/r/556225 (https://phabricator.wikimedia.org/T224585) [17:01:37] (03PS8) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [17:03:38] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [17:05:15] !log lvs100{14,16} - restarting pybal on high-traffic2 + backup, cleaning old entries for recdns [17:05:19] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:22] shorturl: 3https://w.wiki/Dd6 [17:05:24] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10jcrespo) a:05Nuria→03jcrespo [17:05:25] shorturl: 3https://w.wiki/Dd7 [17:05:38] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10jcrespo) Thanks! [17:08:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/556222 (owner: 10Volans) [17:09:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/556220 (owner: 10Volans) [17:12:29] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:14:09] 10Operations, 10ops-eqiad, 10User-fgiunchedi: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Cmjohnson) 05Open→03Resolved Thanks, fixed! [17:17:05] ok, so i believe T240316 lines up with timing of wmf.8 deployment to group1. i could use advice here. if it's genuinely a blocker, it's a blocker for wmf.8 since there's no reason to believe the regression wouldn't persist into wmf.10. [17:17:05] (03CR) 10BBlack: [C: 03+2] lvs recdns: clean up realserver def [puppet] - 10https://gerrit.wikimedia.org/r/555538 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [17:17:07] T240316: Issue with QuickStatements "you are blocked on Wikidata" - https://phabricator.wikimedia.org/T240316 [17:17:16] (03PS6) 10BBlack: lvs recdns: clean up realserver def [puppet] - 10https://gerrit.wikimedia.org/r/555538 (https://phabricator.wikimedia.org/T239993) [17:17:57] i'm loathe to roll this particular train back on tuesday a week after it was first supposed to see the light of day, and could use advice. cc: marxarelli, James_F, Amir1. [17:18:30] (03CR) 10Volans: "recheck" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554543 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [17:19:31] brennen: Agreed. Leave it for now. [17:19:47] We'll treat it as a train blocker for wmf.10 for group1? [17:20:14] ^ works for me [17:20:39] makes sense on the theory that a fix ought to be forthcoming. [17:20:54] awight: Is that other patch being deployed? [17:21:15] the average render time for appservers (not api) raised to ~500ms [17:21:18] :( [17:21:22] on that logic, is there anything in awight's ongoing work with Cite that ought to be treated similarly? [17:21:29] seems at around 16:40 UTC [17:21:57] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET [17:22:19] I am in a meeting so can't investigate right now, is anybody that can follow up? [17:22:22] !log andrew-wmde@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Cite: SWAT: [[gerrit:556218|Catch one last undefined index (T240248)]] (duration: 01m 02s) [17:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:28] T240248: "PHP Notice: Undefined index: key" and similar in Cite.php and ReferenceStack.php - https://phabricator.wikimedia.org/T240248 [17:23:26] Cc: effie, _joe_ --^ [17:23:38] elukey: _joe_ and I are looking [17:24:46] slowness is correlated with APCu fragmentation, which spikes shortly after deploys: https://grafana.wikimedia.org/d/yK1IBFaZk/php7-apcu-usage-wip?orgId=1&from=1575964012936&to=1575998508936&var-datasource=eqiad%20prometheus%2Fops [17:25:12] *correlated with sharp increase in APCu fragmentation [17:25:48] Did the APCu defrag/restart script land? [17:25:48] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/556219 (owner: 10Volans) [17:26:03] Reedy: brennen: Hi, sorry for the delay. Yes, we just deployed a final hotfix for wmf.8, which should have stopped all Cite errors. I'll look at this new blocker now. [17:26:30] (03CR) 10Volans: [C: 03+2] docstrings: fix pep257 reported errors [software/spicerack] - 10https://gerrit.wikimedia.org/r/556219 (owner: 10Volans) [17:26:36] (03CR) 10Volans: [C: 03+2] dnsdisc: use port 5353 to query the resolvers [software/spicerack] - 10https://gerrit.wikimedia.org/r/556222 (owner: 10Volans) [17:26:44] (03CR) 10Volans: [C: 03+2] dns: allow to specify a custom port [software/spicerack] - 10https://gerrit.wikimedia.org/r/556220 (owner: 10Volans) [17:26:54] awight: https://en.wikipedia.org/w/index.php?title=Cut_(Hunters_and_Collectors_album)&action=submit looks fixed [17:27:02] We deployed twice, which would have purged some caches. [17:27:31] Success? [17:28:02] Seems to be for me [17:28:10] James_F: I think so! I'm trying to understand if the new bug is anything related to our work, though. [17:28:20] fatal monitor seems to be timeouts and OOM [17:28:37] awight: The Wikidata API one shouldn't be affected by Cite? [17:29:08] (03PS9) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [17:29:24] James_F: yikes, I certainly hope not--but if there's any parsing happening inside of the API, then yes we would be liable. [17:30:26] (03Merged) 10jenkins-bot: docstrings: fix pep257 reported errors [software/spicerack] - 10https://gerrit.wikimedia.org/r/556219 (owner: 10Volans) [17:30:46] (03Merged) 10jenkins-bot: dnsdisc: use port 5353 to query the resolvers [software/spicerack] - 10https://gerrit.wikimedia.org/r/556222 (owner: 10Volans) [17:30:48] (03Merged) 10jenkins-bot: dns: allow to specify a custom port [software/spicerack] - 10https://gerrit.wikimedia.org/r/556220 (owner: 10Volans) [17:34:48] Good luck! o/ [17:35:41] 10Operations, 10ops-codfw: codfw: rack/setup/install mc203[7,8,9].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10elukey) @Papaul we might want to have a different naming convention for these hosts, please wait until I have a consensus before adding dns entries :) [17:44:14] !log mw1322$ php7adm /apcu-free [17:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:11] status update from _joe_ and I, who are working this out loud very impolitely, we did confirm this is affecting some appservers and not others, possibly as a matter of weighting, and we cleared the apcu cache on 1322 to verify that slowness stops [17:46:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: fix hiera lookup for primary labmon host (typo correction) [puppet] - 10https://gerrit.wikimedia.org/r/556225 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [17:47:10] (03CR) 10Phamhi: [C: 03+2] wmcs: fix hiera lookup for primary labmon host (typo correction) [puppet] - 10https://gerrit.wikimedia.org/r/556225 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [17:47:36] didn't have the effect we expected, still looking [17:48:01] fragmentation might have been a red herring, or at least shares a common cause with slowness rather than having caused it [17:48:32] <_joe_> !log depool mw1322 for debugging [17:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:08] (03CR) 10CDanis: [C: 03+1] monitoring: page on low HTTP global availability [puppet] - 10https://gerrit.wikimedia.org/r/555987 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [17:50:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge-kubernetes: disable profiling on api servers [puppet] - 10https://gerrit.wikimedia.org/r/555634 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [17:50:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge-k8s: reduce the default terminated-pod-gc-threshold [puppet] - 10https://gerrit.wikimedia.org/r/555627 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [17:53:59] (03PS7) 10Jhedden: aptrepo: add ceph nautilus repo for cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) [17:54:30] (03CR) 10Jhedden: "Thanks for catching that. Updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) (owner: 10Jhedden) [17:54:30] <_joe_> !log repooled mw1322, just depooling solved the issue [17:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:24] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:56:47] (03PS2) 10BBlack: lvs recdns: get rid of legacy recursor hostnames [dns] - 10https://gerrit.wikimedia.org/r/555539 (https://phabricator.wikimedia.org/T239993) [17:56:49] (03PS1) 10BBlack: lvs recnds: remove last remaining revdns comments [dns] - 10https://gerrit.wikimedia.org/r/556230 (https://phabricator.wikimedia.org/T239993) [17:58:46] (03CR) 10BBlack: [C: 03+2] lvs recdns: get rid of legacy recursor hostnames [dns] - 10https://gerrit.wikimedia.org/r/555539 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [18:00:04] cscott, arlolra, subbu, halfak, and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191210T1800). [18:01:58] (03PS8) 10Jhedden: aptrepo: add ceph nautilus repo for cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) [18:02:44] (03CR) 10Jhedden: [C: 03+2] aptrepo: add ceph nautilus repo for cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) (owner: 10Jhedden) [18:05:33] bblack or whoever might know, I am seeing a behavior on the dumps servers that I don't fully understand for DNS requests. When I look up an external ipv6 address, it sometimes completes quickly and other times is timing out after 3 seconds. [18:06:02] reproducable sporadically with `dig AAAA @10.3.0.1 sagres.c3sl.ufpr.br` [18:06:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:07:27] apergos: ^^ this obviously relates to that mirror [18:07:41] and better explains what I'm seeing [18:08:19] !log restarting php7.2-fpm on all remaining slow hosts except 1328, held back for investigation: mw[1333,1331,1322,1327,1325] [18:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:09:02] ah yeah [18:09:09] I have been following along on the ticket [18:09:30] !log imported ceph nautilus debian packages into buster-wikimedia/thirdparty/ceph-nautilus-buster T239917 [18:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:35] T239917: Import Buster packages for Ceph Nautilus - https://phabricator.wikimedia.org/T239917 [18:09:38] how sporadic is it, if i try a few times in 5 minutes will I hit it? [18:10:50] 10Operations, 10Traffic, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack) Status: The actual LVS portion of this is now completely removed globally. The IP addresses themselves are also completely unconfigured and removed from service at the all the edge sites, but... [18:10:53] and... if you try some other nameserver I guess it's just hunky-dory, bstorm_? [18:11:15] Yes [18:11:21] At least so far :) [18:11:46] wonderful :-/ [18:12:10] (03PS4) 10BBlack: lvs recdns: remove legacy IP definition, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/556178 (https://phabricator.wikimedia.org/T239993) [18:12:12] (03PS4) 10BBlack: lvs recdns: remove legacy IP definition, step 2 [puppet] - 10https://gerrit.wikimedia.org/r/556179 (https://phabricator.wikimedia.org/T239993) [18:12:26] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:12:54] apergos: I just managed to get it to work fine against our DNS once and do the timeout fail the next run. It's rather sporadic and unpredictable from what I can tell [18:13:09] a fail `;; Query time: 3174 msec` [18:13:32] success, also against our server `;; Query time: 162 msec` [18:19:00] that's really bizarre [18:19:14] I wonder if there is anything in our logs for the failures [18:19:21] !log cp2007: restart traffic-manager.service, seems to have been left in a bad state? [18:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:31] do the packets even get there for the fails? [18:20:13] bstorm_: will take a peek at your example and see if I can figure things out [18:21:25] thanks. If it matters, my source host is labstore1006. Shouldn't matter? [18:21:50] appserver latency recovered fully after _joe_ restarted php7.2-fpm on mw1328, which was the last slow host -- we're going to call it resolved for now rather than dive further into gdb at 19:30 -- we'll follow up and see what we can do in terms of debuggability for next time [18:22:13] !log restarted php7.2-fpm on mw1328 [18:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:04] (03PS3) 10Phamhi: wmcs: make cloudmetrics1002 the primary instead of labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) [18:25:05] bstorm_: the problem is on ufpr.br's end [18:25:23] their dns servers (or their network, whatever) are behaving very erratically and poorly [18:25:33] (03PS2) 10Jcrespo: admin: Add accraze to analytics-privadata-users [puppet] - 10https://gerrit.wikimedia.org/r/556168 (https://phabricator.wikimedia.org/T240243) [18:25:44] (03CR) 10Jcrespo: "I got distracted, will deploy this tomorrow morning." [puppet] - 10https://gerrit.wikimedia.org/r/556168 (https://phabricator.wikimedia.org/T240243) (owner: 10Jcrespo) [18:25:54] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2007 is OK: HTTP OK: HTTP/1.0 200 OK - 22731 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:26:04] I think I have yet to get a solid answer out of them for anything over IPv4. Sometimes I can get responses (with various delays) over IPv6 [18:26:22] huh [18:26:27] could be their DNS servers are misbehaving, could be they're currently being attacked and not much traffic can get through, not sure [18:26:32] Ok. They are apparently strictly v6 [18:26:53] Ours don't cache the response or anything? [18:27:07] Suppose that all makes sense then :) [18:27:15] our servers do the default things [18:27:26] they try to cache things as allowed by the remote auth servers [18:27:33] I see [18:27:36] but for unreachable remote auth servers, there's only so much they can do [18:27:44] (03CR) 10Herron: "I'll plan to go ahead with this in the morning tomorrow Eastern tz. This is now set to critical: false to avoid alert spam from this serv" [puppet] - 10https://gerrit.wikimedia.org/r/556036 (owner: 10Herron) [18:28:00] after that I guess we could investigate whether there's some powerdns config to limit the maximum time before answering our clients with a SERVFAIL [18:28:06] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2007 is OK: HTTP OK: HTTP/1.1 200 Ok - 34311 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:28:08] or whatever's making the query could time out on its own [18:28:57] this is about setting up ferm services, without that we have a problem :-) [18:29:04] (allowing them to rsync from us0 [18:29:39] (03PS4) 10Phamhi: wmcs: make cloudmetrics1002 the primary instead of labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) [18:29:55] bstorm_: looking at the powerdns docs, it seems like 1.5 seconds should be the default wait for a remove DNS server. Possibly up to x4 for the four IPs they advertise.... but still. [18:30:01] s/remove/remote/ [18:30:18] ok [18:30:49] I haven't seen any delays longer than some handful of seconds for 10.3.0.1 for these names [18:30:50] (03CR) 10Jforrester: "This should go ahead of the production (testwiki) one. Do you want it deployed now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555932 (https://phabricator.wikimedia.org/T235717) (owner: 10Kosta Harlan) [18:31:00] apergos: so basically if we add that back to ferm, we have game of Russian roulette on the puppet run :) [18:31:08] and that's clearly no good [18:31:12] oh I guess you haven't either, I was misremembering from IRC above and thought you had longer timeouts [18:31:19] but ~3s, that seems pretty normal for this scenario [18:31:22] Once ferm is up, it would likely be fine, but... [18:32:07] maybe they can look at their end? [18:32:08] So I'm not sure what to tell the mirror operator [18:32:12] so the problem is this is affecting an @resolve() for ferm? [18:32:13] Yeah [18:32:17] yes [18:32:18] exactly [18:32:47] in the short term: I'd unconfigure it for now just to move forward. Their dns/network is broken. Can revert-patch it back into your config later. [18:32:58] It's unconfigured for now :) [18:33:01] yeah [18:33:15] so just need a path forward for them [18:33:18] I've "-1"d my own revert patch [18:33:25] they sent an email asking why they couldn't rsync from us... [18:33:28] Yeah [18:33:41] tell them their DNS and/or network appears to be flaky/broken. [18:33:49] and our recursors can't reliably resolve them [18:33:54] I've been typing up an email to that effect [18:34:05] feel free to cc ops-dumps on it [18:34:13] hopefully ithe alias stays in the cc [18:34:18] in the longer-term: ipresolve() in puppet and @resolve() in ferm are both awful things... [18:34:41] :) [18:34:43] I mean I get it, and it's hard to design something better. But especially for hostnames we don't even own/resolve internally, it's awful [18:34:59] we should put some time into replacing those with something better [18:35:20] (03PS1) 10Mforns: analytics::refinery::job::data_purge: Add growth deletion timers [puppet] - 10https://gerrit.wikimedia.org/r/556232 (https://phabricator.wikimedia.org/T237124) [18:35:21] we can open a ticket to go into your list of 400 tasks most of which you will never get to :-P [18:35:41] when I said "we", I meant the kind of "we" that doesn't include "me" :) [18:35:47] ahahahahaha [18:36:17] 😁 [18:37:20] email away. [18:37:25] Thanks bblack and apergos [18:37:33] thanks b black and b storm! [18:39:07] (03PS2) 10Mforns: analytics::refinery::job::data_purge: Add growth deletion timers [puppet] - 10https://gerrit.wikimedia.org/r/556232 (https://phabricator.wikimedia.org/T237124) [18:39:48] brennen: i see your SAL entries for branch.py failure. did you ever get that working for the wmf branch cut? [18:39:55] James_F: ^ fyi [18:40:41] marxarelli: i did not, but i think that's because i was using the unpatched version. [18:40:52] i see [18:41:02] maybe i'll give it a go for wmf.10 [18:41:11] i think it's worth a shot. [18:42:09] one second, digging in bash history... [18:43:29] marxarelli: this is the invocation i attempted, per twentyafterfour: ./branch.py --core --core-bundle wmf_core --bundle wmf_branch --branchpoint HEAD --core-version 1.35.0-wmf.8 wmf/1.35.0-wmf.8 [18:44:00] we've since merged the patch [18:44:44] brennen, twentyafterfour: i'll give it a whirl [18:45:07] marxarelli: let me know if you have trouble and try to paste a log somewhere if anything does go wrong [18:45:28] twentyafterfour: will do. thanks! [18:46:31] https://doc.powerdns.com/recursor/settings.html#server-down-max-fails [18:46:40] ^ potentially this could be tuned to reduce the impact of such cases [18:47:03] right now a server has to fail like 64 times in a row before the recursor just calls it dead for a while (a while being 60s by default) [18:47:40] but it's hard to tune things like that and not create as many problems as you solve for other cases [18:48:00] still, 64 in a row is a lot of failure, and 60s isn't much hold time [18:48:08] it sure is (a lot of failures) [18:48:55] maybe if it were more like 10 fails -> 300s hold, maybe a case like this might return enough SERVFAILS fast enough that things don't completely break on the ferm end of it [18:49:11] but then again, I guess SERVFAIL might also break the ferm run anyways [18:49:52] (03PS1) 10Ammarpad: Add 2020: Wikimania namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556234 (https://phabricator.wikimedia.org/T240339) [18:50:54] (03PS2) 10Ammarpad: Add 2020: Wikimania namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556234 (https://phabricator.wikimedia.org/T240339) [18:52:07] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:54:25] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:56:20] I expect it might [18:56:24] (break the ferm run) [18:56:33] well we'll see what they come up with [18:57:32] James_F: thinking i'll start the branch cut now and backport any additional fixes that merge for T240316 and T240345 [18:57:32] T240316: Issue with QuickStatements "you are blocked on Wikidata" - https://phabricator.wikimedia.org/T240316 [18:57:32] T240345: {{#expr: expression }} breaks references in the whole article if expression throws an error - https://phabricator.wikimedia.org/T240345 [18:57:44] (03PS2) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) [18:57:44] marxarelli: Sure. [18:58:00] (03PS2) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [18:58:03] marxarelli: I just added the branch.py instructions to https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Create_the_new_branch_in_Gerrit BTW. [18:58:54] cool, thanks [18:59:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191210T1900) [19:00:37] bstorm_: we have an answer already [19:00:42] (03PS4) 10Mforns: Remove all references to Wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) [19:01:11] saying we are 'too vague' [19:01:19] !log cutting branch for 1.35.0-wmf.10 cc: T233858 [19:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:25] T233858: 1.35.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T233858 [19:01:54] bblack: care to copy-paste some grody details someplace that can be slapped into an email? [19:02:37] Yup! I replied, but yeah, more details from bblack will help. [19:18:18] (03CR) 10Arlolra: [C: 04-1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556001 (https://phabricator.wikimedia.org/T237326) (owner: 10Arlolra) [19:18:46] (03PS1) 10CDanis: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556236 [19:21:38] (03PS3) 10Herron: ganeti: assign ganeti400[123] role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/555761 (https://phabricator.wikimedia.org/T226444) [19:24:19] twentyafterfour: branch.py seems to have skipped branch creation for core and then failed on the push to gerrit for review [19:25:23] (03PS4) 10Herron: ganeti: assign ganeti400[123] role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/555761 (https://phabricator.wikimedia.org/T226444) [19:33:09] PROBLEM - Host cp3055 is DOWN: PING CRITICAL - Packet loss = 100% [19:37:57] twentyafterfour: i think i see why. https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/543248/19/make-release/mwrelease/branch.py#b186 removed the call the `create_branch('mediawiki/core', ...` [19:39:45] i'll have to fallback to make-wmf-branch for the time being i think [19:40:02] :-( [19:40:14] or create the mediawiki/core branch manually and then re-run branch.py [19:41:08] however, i can't think of a reason not to restore the explicit call to create_branch prior to do_core_work [19:41:37] James_F/twentyafterfour ^ ? [19:42:06] I'm not across the code enough to be certain, but I think you're right. [19:45:40] <_joe_> !log restarting php-fpm on mw1332,1319 (high latency) [19:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:52:08] James_F: k. patched the script and got it to work https://gerrit.wikimedia.org/r/c/mediawiki/core/+/556241 [19:52:35] marxarelli: Nice. [19:52:55] marxarelli: But DefaultSettings.php isn't patched? [19:53:18] grr [19:53:21] should have been [19:53:49] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/556241/1/includes/DefaultSettings.php [19:54:09] Oh, right, I'm just blind. [19:54:13] Hurrah. Let's rock. [20:00:04] marxarelli and James_F: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191210T2000). [20:00:20] marxarelli: Apparently we're up, who knew? [20:00:49] :) [20:02:19] marxarelli: Presumably we need the wmf.10 branch to be merged before we can do things with it. ;-) [20:05:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:06:04] 10Operations, 10Release-Engineering-Team-TODO, 10observability, 10Release-Engineering-Team (Deployment services): "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520 (10hashar) 05Resolved→03Open I explained it in the task, the... [20:06:13] James_F: with all the waiting on jenkins, i'm not sure the branch cut is overall faster with branch.py. though i guess once we're confident in it, we can have our automated process pass `--no-review` [20:06:37] Yeah. [20:06:46] There's still manual review before deploy. [20:11:32] (03Abandoned) 10Gergő Tisza: Use dblist for wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546895 (owner: 10Gergő Tisza) [20:15:24] (03PS4) 10Gergő Tisza: Add growthexperiments dblist, for puppet usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) [20:18:15] (03PS5) 10Jhedden: ceph: initial monitor and mgr daemon modules [puppet] - 10https://gerrit.wikimedia.org/r/556017 (https://phabricator.wikimedia.org/T239918) [20:21:03] (03CR) 10Jhedden: [C: 03+2] ceph: initial monitor and mgr daemon modules [puppet] - 10https://gerrit.wikimedia.org/r/556017 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [20:24:09] (03PS1) 10Dduvall: group0 to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556244 [20:24:20] (03PS2) 10Gergő Tisza: mediawiki: maintenance script for purging old GrowthExperiments data [puppet] - 10https://gerrit.wikimedia.org/r/546896 (https://phabricator.wikimedia.org/T208369) [20:27:45] marxarelli: there is a reason not to call create_branch, because if you create the branch beforehand it causes the release notes automation to fail [20:28:33] James_F: looks like there's some cleanup to do. i'm in no rush today so i'll go ahead with the scap cleans [20:28:59] thcipriani: ^ [20:30:15] the way make-wmf-branch does it is that it creates the branch locally and pushes it the old fashioned way (instead of making the branch via rest api) and that's how branch.py is supposed to do it now as well [20:30:32] marxarelli: Sure. [20:30:36] twentyafterfour: it does it that way if you specify `--no-review` [20:31:00] hmm [20:31:03] otherwise it tries to submit the core changes to gerrit for review on a branch that doesn't exist [20:31:09] !log cdanis@cumin2001 conftool action : set/weight=20; selector: cluster=appserver,dc=eqiad,service=apache2,name=mw132[34].* [20:31:11] ohh [20:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:18] !log cdanis@cumin2001 conftool action : set/weight=20; selector: cluster=appserver,dc=eqiad,service=nginx,name=mw132[34].* [20:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:39] marxarelli: I see, that makes sense. I guess we need to make --no-review the only / default option then [20:31:53] that or change thcipriani' [20:32:01] thcipriani's release notes tool [20:32:47] is there an error if you submit a CR to create a branch? [20:33:32] might depend on permissions? [20:33:38] !log dduvall@deploy1001 Pruned MediaWiki: 1.35.0-wmf.4 (duration: 06m 40s) [20:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:09] hrm, or maybe you just can't have that. [20:35:59] thcipriani: I guess gerrit doesn't support creating a branch via code review push? [20:36:24] !log dduvall@deploy1001 Pruned MediaWiki: 1.35.0-wmf.3 (duration: 01m 52s) [20:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:40] as I think about it that probably makes sense, what are you expected to review if there's no prior state? [20:37:11] !log ✔️ cdanis@mw1323.eqiad.wmnet ~ 🕞🍵 sudo renice -n -19 `pidof mcrouter` [20:37:14] well. I guess we need to figure out a different trigger for the deploy-notes. [20:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:58] https://gerrit-review.googlesource.com/Documentation/error-branch-not-found.html [20:38:05] (03PS1) 10Jhedden: add ceph fake keys [labs/private] - 10https://gerrit.wikimedia.org/r/556246 [20:38:09] !log dduvall@deploy1001 Pruned MediaWiki: 1.35.0-wmf.5 (duration: 01m 36s) [20:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:26] !log dduvall@deploy1001 Started scap: testwiki to php-1.35.0-wmf.10 and rebuild l10n cache [20:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:02] (03CR) 10Jhedden: [V: 03+2 C: 03+2] add ceph fake keys [labs/private] - 10https://gerrit.wikimedia.org/r/556246 (owner: 10Jhedden) [20:41:21] (03PS3) 10Gergő Tisza: mediawiki: maintenance script for purging old GrowthExperiments data [puppet] - 10https://gerrit.wikimedia.org/r/546896 (https://phabricator.wikimedia.org/T208369) [20:42:09] twentyafterfour: yep, guess we need to go back to using the gerrit api to create_branch for core and remove the zuul trigger for creating deployment notes. 1st step might be (1) manually triggering branch cut (2) confirm submodules patch +2 (3) manually trigger deploy notes. Later we can move to using no-review and just triggering one job from the other... [20:42:56] (03Abandoned) 10Paladox: vagrant::mediawiki: Create /srv/mediawiki-vagrant/.vagrant/machines [puppet] - 10https://gerrit.wikimedia.org/r/406484 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox) [20:44:01] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10wiki_willy) a:03RobH [20:45:21] (03PS1) 10Jhedden: ceph: add ceph monitor role to cloudcephmon servers [puppet] - 10https://gerrit.wikimedia.org/r/556247 (https://phabricator.wikimedia.org/T239918) [20:53:25] (03PS2) 10Jhedden: ceph: add ceph monitor role to cloudcephmon servers [puppet] - 10https://gerrit.wikimedia.org/r/556247 (https://phabricator.wikimedia.org/T239918) [20:55:33] (03PS3) 10Jhedden: ceph: add ceph monitor role to cloudcephmon servers [puppet] - 10https://gerrit.wikimedia.org/r/556247 (https://phabricator.wikimedia.org/T239918) [20:59:31] (03PS4) 10Jhedden: ceph: add ceph monitor role to cloudcephmon servers [puppet] - 10https://gerrit.wikimedia.org/r/556247 (https://phabricator.wikimedia.org/T239918) [21:03:39] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10wiki_willy) a:03RobH [21:05:11] (03PS1) 10Jhedden: update fake ceph mon secret paths [labs/private] - 10https://gerrit.wikimedia.org/r/556251 [21:05:45] (03CR) 10Jhedden: [V: 03+2 C: 03+2] update fake ceph mon secret paths [labs/private] - 10https://gerrit.wikimedia.org/r/556251 (owner: 10Jhedden) [21:08:34] (03CR) 10Jhedden: [C: 03+2] "puppet complier results: https://puppet-compiler.wmflabs.org/compiler1003/19894/" [puppet] - 10https://gerrit.wikimedia.org/r/556247 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [21:11:10] (03CR) 10Phamhi: [C: 03+1] ceph: add ceph monitor role to cloudcephmon servers [puppet] - 10https://gerrit.wikimedia.org/r/556247 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [21:16:47] !log dduvall@deploy1001 Finished scap: testwiki to php-1.35.0-wmf.10 and rebuild l10n cache (duration: 37m 20s) [21:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:13] marxarelli: Clearly lots of i18n this week. [21:19:30] James_F: seems like it [21:20:07] alrighty. about to promote group0 1.35.0-wmf.10 [21:20:33] Cool. [21:20:51] (03CR) 10Dduvall: [C: 03+2] group0 to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556244 (owner: 10Dduvall) [21:21:42] (03Merged) 10jenkins-bot: group0 to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556244 (owner: 10Dduvall) [21:23:45] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.10 [21:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:15] !log promoted group0 to 1.35.0-wmf.10 cc: T233858 [21:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:20] T233858: 1.35.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T233858 [21:26:12] James_F: looks ok so far [21:26:29] +1 [21:32:34] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10Jdforrester-WMF) [21:34:25] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10brennen) For purposes of T239985, a Jenkins upgrade. [21:37:11] (03PS5) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 [21:38:38] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10hashar) [21:39:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Jclark-ctr) sent TSR report after running onboard diagnostics that had faults for memory and psu1 & psu2 . TSR report sho... [21:39:41] 10Operations, 10Puppet, 10Security, 10User-jbond: Add method to admin module ci to detect removed users - https://phabricator.wikimedia.org/T239070 (10chasemp) [21:45:38] SF office network seems down? [21:48:50] (03PS1) 10Jhedden: ceph: update mon ferm rules and manage ceph service account [puppet] - 10https://gerrit.wikimedia.org/r/556260 (https://phabricator.wikimedia.org/T239918) [21:48:55] Aha, and we're back. [21:50:51] (03CR) 10Jhedden: [C: 03+2] ceph: update mon ferm rules and manage ceph service account [puppet] - 10https://gerrit.wikimedia.org/r/556260 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [21:51:10] (03CR) 10Jhedden: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/19896/" [puppet] - 10https://gerrit.wikimedia.org/r/556260 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [21:54:02] (03PS1) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 [21:57:40] (03PS2) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 [21:58:04] (03PS1) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [labs/private] - 10https://gerrit.wikimedia.org/r/556268 [21:58:31] (03PS2) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [labs/private] - 10https://gerrit.wikimedia.org/r/556268 [21:59:25] (03PS1) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 [22:00:18] (03CR) 10Reedy: [C: 04-1] "Minus one to signify it needs a puppet private patch at the same time :)" [puppet] - 10https://gerrit.wikimedia.org/r/556265 (owner: 10Paladox) [22:01:39] (03PS3) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 [22:01:40] (03CR) 10Paladox: "> Minus one to signify it needs a puppet private patch at the same" [puppet] - 10https://gerrit.wikimedia.org/r/556265 (owner: 10Paladox) [22:01:44] (03PS2) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 [22:02:14] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 (owner: 10Paladox) [22:02:44] (03PS4) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 [22:02:57] (03PS3) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 [22:03:44] (03PS4) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 [22:04:08] (03PS1) 10Paladox: Gerrit: Add ed25519 and ecdsa fake ssh host keys [labs/private] - 10https://gerrit.wikimedia.org/r/556271 [22:04:48] (03PS5) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 [22:04:58] (03PS5) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 [22:05:24] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 (owner: 10Paladox) [22:07:05] (03PS2) 10Paladox: Gerrit: Add ed25519 and ecdsa fake ssh host keys [labs/private] - 10https://gerrit.wikimedia.org/r/556271 [22:08:40] (03CR) 10Jforrester: [C: 03+1] "Let's land this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [22:09:21] (03PS6) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 [22:09:33] (03PS6) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 [22:09:50] (03PS7) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 [22:12:03] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:13:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:17:02] (03PS1) 10Jhedden: ceph: require service account before package install [puppet] - 10https://gerrit.wikimedia.org/r/556274 (https://phabricator.wikimedia.org/T239918) [22:17:27] (03CR) 10Reedy: Gerrit: Add ed25519 and ecdsa ssh host keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556270 (owner: 10Paladox) [22:18:13] (03CR) 10Jhedden: [C: 03+2] ceph: require service account before package install [puppet] - 10https://gerrit.wikimedia.org/r/556274 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [22:19:22] (03CR) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556270 (owner: 10Paladox) [22:23:13] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [22:24:23] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [22:27:59] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@7c8cb9d]: Update mobileapps to 3b1ba07 [22:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:32] (03CR) 10CDanis: [C: 03+1] "looks good, just one nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [22:32:36] I'm getting db locked for Special:UploadWizard on commons: https://commons.wikimedia.org/wiki/Special:UploadWizard [22:32:48] non stop, for minutes now [22:33:09] ok now [22:33:30] https://commons.wikimedia.org/w/api.php?format=xml&action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb=1 [22:33:32] There seems to be lag [22:33:57] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@7c8cb9d]: Update mobileapps to 3b1ba07 (duration: 05m 58s) [22:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:38] there is a _lot_ of read traffic on s4 right now [22:35:01] (03PS17) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [22:35:02] not sure where it is coming from, but, baseline is something like 200k rps, but it's getting 6-7M rps right now [22:35:47] * Amir1 stands still holding his axe [22:35:59] we should investigate [22:36:08] (03CR) 10Jbond: "thanks updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [22:36:59] (03PS6) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 [22:45:31] (03PS1) 10Jhedden: ceph: add ceph object store role to cloudcephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/556279 (https://phabricator.wikimedia.org/T239918) [22:50:06] (03PS1) 10Krinkle: admin: change matrix.php column "grp" to "groups" [puppet] - 10https://gerrit.wikimedia.org/r/556281 [22:50:08] (03CR) 10Volans: "recheck" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554543 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [22:53:06] (03CR) 10Gergő Tisza: "It's scheduled for the next SWAT window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [22:55:39] (03PS1) 10Krinkle: mediawiki: Remove unused HHVM files [puppet] - 10https://gerrit.wikimedia.org/r/556282 (https://phabricator.wikimedia.org/T229792) [23:11:24] (03PS2) 10Jhedden: ceph: add ceph object store role to cloudcephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/556279 (https://phabricator.wikimedia.org/T239918) [23:16:02] (03CR) 10Volans: [C: 03+2] frack: fix asset tag management records [dns] - 10https://gerrit.wikimedia.org/r/554079 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [23:16:07] (03PS5) 10Volans: frack: fix asset tag management records [dns] - 10https://gerrit.wikimedia.org/r/554079 (https://phabricator.wikimedia.org/T239597) [23:19:22] (03CR) 10Volans: [C: 03+2] dns: generate DNS snippets from Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554543 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [23:20:15] (03PS1) 10Jhedden: add fake keys for ceph osd profile [labs/private] - 10https://gerrit.wikimedia.org/r/556290 [23:20:33] (03CR) 10Jhedden: [V: 03+2 C: 03+2] add fake keys for ceph osd profile [labs/private] - 10https://gerrit.wikimedia.org/r/556290 (owner: 10Jhedden) [23:24:06] (03CR) 10Jhedden: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/compiler1003/19898/" [puppet] - 10https://gerrit.wikimedia.org/r/556279 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden)