[00:42:21] 10Operations, 10Traffic: /sec-warning page: please add a helpful XML comment explaining why it's being delivered. - https://phabricator.wikimedia.org/T240794 (10Krenair) I'd prefer we talked about the drawbacks of serving the /sec-warning HTML with different HTTP status codes, than doing anything special in th... [00:48:49] PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:19] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [01:39:23] RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:25] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:22:29] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [03:24:25] 10Operations, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Krenair) Speaking of stats - though on TLS versions rather than ciphers - do we have numbers on how many connections/requests/users were using TLS 1.0/1.1? [03:27:32] 10Operations, 10Traffic: HTTPS/Browser Recommendations page on Wikitech is outdated - https://phabricator.wikimedia.org/T240813 (10Emufarmers) [03:32:40] 10Operations, 10Traffic, 10Documentation: Update TLS/HTTP documentation on wikitech - https://phabricator.wikimedia.org/T96844 (10AntiCompositeNumber) This task has been open since 2015 and the Wikitech HTTPS documentation is still in need of improvement. While the HTTPS pages on Wikitech have been updated s... [03:36:46] (03CR) 10Tim Starling: [C: 03+2] keys.txt: Only include Tim's current key (73F146FECF9D333C) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557158 (owner: 10Legoktm) [03:36:56] (03CR) 10Tim Starling: [C: 03+2] keys.html: Include Tim's new key (73F146FECF9D333C) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557159 (owner: 10Legoktm) [03:37:41] (03Merged) 10jenkins-bot: keys.txt: Only include Tim's current key (73F146FECF9D333C) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557158 (owner: 10Legoktm) [03:37:50] (03Merged) 10jenkins-bot: keys.html: Include Tim's new key (73F146FECF9D333C) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557159 (owner: 10Legoktm) [03:49:12] !log tstarling@deploy1001 Synchronized docroot/mediawiki.org/keys/keys.txt: (no justification provided) (duration: 01m 01s) [03:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:06] !log tstarling@deploy1001 Synchronized docroot/mediawiki.org/keys/keys.html: (no justification provided) (duration: 00m 57s) [03:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:04] 10Operations, 10Traffic: HTTPS/Browser Recommendations page on Wikitech is outdated - https://phabricator.wikimedia.org/T240813 (10Emufarmers) [03:55:06] 10Operations, 10Traffic, 10Documentation: Update TLS/HTTP documentation on wikitech - https://phabricator.wikimedia.org/T96844 (10Emufarmers) [04:23:11] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [04:44:28] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10jijiki) It appears that processing backlog started increasing on 11 Dec between 21:15-21:30 UTC https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?... [05:13:45] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [06:11:12] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Joe) Specifically, we have a 6 million jobs backlog on recentChangesUpdate, evergrowing since Dec 11th. I don't see any of them in the logs for JobExecutor,... [06:11:42] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Joe) [06:24:46] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Joe) Looking at `cpjobqueue` logs, it's clear it's getting 500 responses at least to some of the requests: ` stream: cpjobqueue.retry.mediawiki.job.... [06:25:39] (03PS1) 10Marostegui: dbproxy: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/557788 (https://phabricator.wikimedia.org/T238399) [06:26:14] (03PS1) 10Ammarpad: Add additional import sources for zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557789 (https://phabricator.wikimedia.org/T240814) [06:28:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 for schema change', diff saved to https://phabricator.wikimedia.org/P9871 and previous config saved to /var/cache/conftool/dbconfig/20191216-062809-marostegui.json [06:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:52] !log Stop replication on db1121 for schema change [06:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:13] !log Remove triggers for ar_comment on db1125:3314 T234704 [06:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:19] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [06:36:26] 10Operations, 10netops: Routinator RSYNC errors - https://phabricator.wikimedia.org/T240817 (10ayounsi) 05Open→03Resolved p:05Triage→03Normal [06:36:55] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [06:37:11] I fixed the RPKI grafana alerting rules in https://phabricator.wikimedia.org/T240817 [06:37:16] they should stop flapping [06:37:54] 10Operations, 10DBA, 10Wikimedia-Incident: Disallow 'weight: 0' for MW db config in dbctl - https://phabricator.wikimedia.org/T239901 (10Marostegui) >>! In T239901#5739472, @Krinkle wrote: > > Is there a use case for having a replica only listed in "general" with weight 0? (As opposed to the lowest weight o... [06:39:01] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Joe) In the meantime, it seems that processing of `recentChangesUpdate` gets completely deduplicated for some reason. Let's start with the logfile that b... [06:39:43] !log Recreate views on commonswiki,testcommonswiki for protected_titles on all labsdb hosts - T233135 [06:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:49] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [06:40:43] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) Got an email from a new person on charge of the task. Sent them the updated list of what needs to be updated. [06:41:48] (03CR) 10Marostegui: [C: 03+2] dbproxy: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/557788 (https://phabricator.wikimedia.org/T238399) (owner: 10Marostegui) [06:42:16] !log Depool labsdb1010 - T238399 [06:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:21] T238399: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 - https://phabricator.wikimedia.org/T238399 [06:56:49] !log Force re-learn cycle on db1130 [06:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:51] <_joe_> !log clearing apcu across multiple api servers to allow metrics to be collected again (task coming soon) [06:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:48] !log pool maps2004. osm import is complete - T239728 [06:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:52] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [07:01:21] RECOVERY - MegaRAID on db1130 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:07:33] 10Operations, 10Maps, 10Discovery-Search (Current work): Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Mathew.onipe) [07:09:22] !log depool maps2001 for postgres reinit - T239728 [07:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:28] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [07:09:36] 10Operations, 10serviceops: PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /var/www/php-monitoring/lib.php on line 35 - https://phabricator.wikimedia.org/T240824 (10Marostegui) [07:11:37] !log Stop replication in the same position in labsdb1010 and labsdb1012 - T238399 [07:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:42] T238399: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 - https://phabricator.wikimedia.org/T238399 [07:13:40] <_joe_> !log restarting cpjobqueue on scb1001 to check if processing rate of recentChanges recovers T240518 [07:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:45] T240518: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 [07:27:19] !log Disable auto-learn on db[1126-1138].eqiad.wmnet T240823 [07:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:24] T240823: db1130 BBU possible issues - https://phabricator.wikimedia.org/T240823 [07:38:43] !log Disable auto-learn on db21[03-35] T240823 [07:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:49] T240823: db1130 BBU possible issues - https://phabricator.wikimedia.org/T240823 [07:42:54] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Joe) Again, looking specifically to recentChangesUpdate: ` # Before the problem $ zgrep -F recentChangesUpdate JobExecutor.log-20191209.gz | fgrep Fini... [07:43:25] (03PS1) 10Marostegui: filtered_tables: Remove ar_comment reference [puppet] - 10https://gerrit.wikimedia.org/r/557808 (https://phabricator.wikimedia.org/T234704) [07:52:38] !log cp1075: depool, vhtcpd not running [07:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:47] 10Operations, 10serviceops: PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /var/www/php-monitoring/lib.php on line 35 - https://phabricator.wikimedia.org/T240824 (10jijiki) `[06:58] <_joe_> | !log clearing apcu across multiple api servers to allow metrics... [08:05:40] 10Operations, 10Traffic: vhtcpd segfaulted and did not get restarted - https://phabricator.wikimedia.org/T240826 (10ema) [08:05:49] 10Operations, 10Traffic: vhtcpd segfaulted and did not get restarted - https://phabricator.wikimedia.org/T240826 (10ema) p:05Triage→03High [08:08:52] !log cp1075: manually start vhtcpd.service T240826 [08:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:59] T240826: vhtcpd segfaulted and did not get restarted - https://phabricator.wikimedia.org/T240826 [08:10:57] RECOVERY - Varnish HTCP daemon on cp1075 is OK: PROCS OK: 1 process with UID = 115 (vhtcpd), args vhtcpd https://wikitech.wikimedia.org/wiki/Varnish [08:11:25] 10Operations, 10Traffic: vhtcpd segfaulted and did not get restarted - https://phabricator.wikimedia.org/T240826 (10ema) [08:12:03] !log cp1075: wipe varnish-fe and ats-be caches due to missed purges T240826 [08:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:49] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10elukey) From the Kafka Burrow consumer group metrics for `eqiad.mediawiki.job.recentChangesUpdate`: https://grafana.wikimedia.org/d/000000484/kafka-cons... [08:24:53] PROBLEM - SSH on ms-be2033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:25:37] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2005.codfw.wmnet, ms-fe2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:27:27] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2006.codfw.wmnet, ms-fe2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:29:15] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:29:15] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:30:09] RECOVERY - SSH on ms-be2033 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:33:48] <_joe_> uh what happened at swift in codfw? [08:36:30] !log cp1075: repool all services T240826 [08:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:36] T240826: vhtcpd segfaulted and did not get restarted - https://phabricator.wikimedia.org/T240826 [08:37:43] _joe_: there seem to be various issues [08:37:53] on ms-be2016: DISK CRITICAL - /srv/swift-storage/sdk1 is not accessible: Input/output error [08:38:31] ms-be2033 apparently lost connectivity? I guess, given the ssh socket timeout [08:38:44] I can't log into neither ms-be2016 nor the mgmt [08:39:06] getting a "connection closed by UNKNOWN port 65535" [08:39:35] ms-be2035 was just a monitoring blip, I triggered rechecks and it all cleared up [08:40:38] <_joe_> so the problem on 2016 probably triggered all the problems [08:40:54] <_joe_> including the transient one on another server if I had to guess [08:41:55] errors-wise things look good: https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-DC=codfw&var-prometheus=codfw%20prometheus%2Fops [08:41:59] 10Operations, 10ops-codfw: Degraded RAID on ms-be2016 - https://phabricator.wikimedia.org/T240798 (10MoritzMuehlenhoff) a:03Papaul I can't log into the host neither via SSH not the mgmt. Papaul, can you have a look the next time you're in the DC? [08:43:00] 10Operations, 10ops-codfw: Degraded RAID on ms-be2016 - https://phabricator.wikimedia.org/T240798 (10MoritzMuehlenhoff) Logins are failing with "Connection closed by UNKNOWN port 65535" [08:43:04] <_joe_> moritzm: the server is responding to ping though [08:43:23] <_joe_> uhm [08:45:12] yeah the initial part of ssh connection establishment does take place, things stop at "debug1: Offering public key [...]" [08:45:21] greetings [08:45:33] godog: buongiorno! [08:45:37] it might also be the usual random XFS login spike, but I can't even access the SSH on the mgmt for some reason [08:45:45] 10Operations, 10ops-codfw: Degraded RAID on ms-be2016 - https://phabricator.wikimedia.org/T240798 (10Joe) This happens after the key is offered for authentication. [08:45:51] "XFS load spike", not "login spike" [08:46:50] swift codfw not in the christmas spirit, looks like? [08:48:24] ilo on ms-be2016 actually works, it's one of those which have a very outdated SSH version which doesn't negotiate current SSH kexes [08:49:00] but no console is possible and it's just spewing "rejecting I/O to offline device" every second [08:49:09] (03CR) 10Ema: [C: 03+2] ATS: assign 8G instead of 2G to RAM caches on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/557031 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [08:50:00] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10jcrespo) @DutchTina @1997kB @MarcoAurelio Could you notify to the Global Renamers list of the issue, so they are aware of this situation? [08:50:21] moritzm: powercycle in that case I'd say [08:51:16] godog: yeah, will do it in a bit [08:52:25] sounds good! LMK if I can help with it [08:52:46] (03CR) 10Marostegui: [C: 03+2] filtered_tables: Remove ar_comment reference [puppet] - 10https://gerrit.wikimedia.org/r/557808 (https://phabricator.wikimedia.org/T234704) (owner: 10Marostegui) [08:52:50] (03CR) 10Ayounsi: [C: 03+1] "Tested and works fine." [software/homer] - 10https://gerrit.wikimedia.org/r/556703 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:53:11] (03CR) 10Ayounsi: [C: 03+2] netbox: split generic and device-specific data [software/homer] - 10https://gerrit.wikimedia.org/r/556703 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:53:49] !log powercycling ms-be2016 T240798 [08:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:55] T240798: Degraded RAID on ms-be2016 - https://phabricator.wikimedia.org/T240798 [08:54:04] !log cp1077: ats-backend-restart to increase RAM cache size T238494 [08:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:09] T238494: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [08:55:43] (03CR) 10jerkins-bot: [V: 04-1] netbox: split generic and device-specific data [software/homer] - 10https://gerrit.wikimedia.org/r/556703 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:56:04] (03PS3) 10Ayounsi: Add virtual-chassis support [homer/public] - 10https://gerrit.wikimedia.org/r/550370 [08:56:27] PROBLEM - Host ms-be2016 is DOWN: PING CRITICAL - Packet loss = 100% [08:57:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/557103 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [08:57:49] RECOVERY - Host ms-be2016 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [08:58:13] RECOVERY - MD RAID on ms-be2016 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:58:23] RECOVERY - Disk space on ms-be2016 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2016&var-datasource=codfw+prometheus/ops [08:59:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "More context: this only applies to WMCS because the graphite replication is rsync+cron. In production replication is achieved via mirrorin" [puppet] - 10https://gerrit.wikimedia.org/r/557103 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [08:59:07] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [09:00:13] 10Operations, 10netops: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10jcrespo) @ayounsi For context, I suggested him on IRC to file a ticket if problems were continuous to have a look (not necessarily to fix them, if there is nothing we can do... [09:00:55] (03CR) 10Filippo Giunchedi: [C: 03+2] Install pygerrit2 on releases server [puppet] - 10https://gerrit.wikimedia.org/r/557075 (https://phabricator.wikimedia.org/T196517) (owner: 1020after4) [09:01:11] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1653 days) https://wikitech.wikimedia.org/wiki/Logs [09:02:59] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:03:20] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Sotiale) >>! In T240518#5743510, @jcrespo wrote: > @DutchTina @1997kB @MarcoAurelio Could you notify to the Global Renamers list of the issue, so they ar... [09:04:55] !log mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Coffeeandcrumbs /home/urbanecm/T240825 (T240825) [09:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:01] T240825: Server-side upload for Coffeeandcrumbs - https://phabricator.wikimedia.org/T240825 [09:09:19] (03PS8) 10KartikMistry: Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) [09:10:03] (03CR) 10jerkins-bot: [V: 04-1] Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [09:11:16] 10Operations, 10ops-codfw: Degraded RAID on ms-be2016 - https://phabricator.wikimedia.org/T240798 (10MoritzMuehlenhoff) After a power cycle the host is up just fine again, there's nothing in kern/syslog by the time of the crash. Filippo is checking whether there's a firmware update available for the RAID contr... [09:12:15] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [09:13:59] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [09:14:00] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:19] !log upgrade hw raid firmware on ms-be2016 and reboot - T240798 [09:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:24] T240798: Degraded RAID on ms-be2016 - https://phabricator.wikimedia.org/T240798 [09:24:03] !log Reloading Jenkins CI [09:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:36] (03PS3) 10Volans: netbox: split generic and device-specific data [software/homer] - 10https://gerrit.wikimedia.org/r/556703 (https://phabricator.wikimedia.org/T228388) [09:25:38] (03PS8) 10Volans: Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [09:25:40] (03PS3) 10Volans: netbox: add vlan support [software/homer] - 10https://gerrit.wikimedia.org/r/550375 (owner: 10Ayounsi) [09:25:58] 10Operations, 10ops-codfw: Degraded RAID on ms-be2016 - https://phabricator.wikimedia.org/T240798 (10fgiunchedi) 05Open→03Resolved hw raid firmware upgraded, resolving [09:30:40] (03CR) 10Ayounsi: [C: 03+2] netbox: split generic and device-specific data [software/homer] - 10https://gerrit.wikimedia.org/r/556703 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [09:36:03] (03CR) 10Muehlenhoff: Add image tracking support (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [09:49:29] (03PS1) 10Jon Harald Søby: Add no=>nb in $wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557838 (https://phabricator.wikimedia.org/T174160) [09:50:46] !log Stop replication in the same position in labsdb1010 and labsdb1012 - T238399 [09:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:52] T238399: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 - https://phabricator.wikimedia.org/T238399 [09:51:30] Nikerabbit, kart_: Want me to help with 548730? [09:51:46] (03CR) 10Nikerabbit: [C: 03+1] Add no=>nb in $wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557838 (https://phabricator.wikimedia.org/T174160) (owner: 10Jon Harald Søby) [09:52:47] (03PS10) 10Volans: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [09:52:52] James_F: yeah what would be the best way to keep ~200 wikipedias having cx in beta while newly created wikis should not get that setting? [09:53:14] Just put them in the array? Creating a new dblist is quite expensive for such a short-term use. [09:54:08] the cx beta list will be getting shorter, but over which period of time is not yet clear [09:54:14] wgContentTranslationAsBetaFeature means "Hide CT behind a beta feature flag rather than enable by default", right? Except it's not loaded on non-Wikipedias, per wmgUseContentTranslation [09:54:38] (03CR) 10Volans: "Thanks for the review, replies inline, fixed in PS10" (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [09:54:50] If you wanted to be clever, you could re-use wmgUseContentTranslation with 'beta' as the value (which is truthy), with a secondary check inside the CommonSettings if. [09:54:55] * James_F shrugs. [09:55:12] Want me to fiddle with the patch? [09:57:31] if kart_ (ping) is okay with that :) [09:58:36] yes, there are three possible states: no cx, cx in beta and cx out of beta [09:58:44] currently implemented using two separate variables [10:00:45] 10Operations, 10serviceops, 10Kubernetes: Collect metrics from envoy where it is enabled on k8s - https://phabricator.wikimedia.org/T237234 (10Joe) 05Open→03Resolved p:05Triage→03Normal [10:00:49] Long-term it'll be just two states, right? [10:01:03] Makes sense to have two variables rather than add complexity permanently. [10:01:05] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe) [10:01:39] (03CR) 10Ayounsi: [C: 03+2] "Tested and works as expected." [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [10:01:57] James_F: yes eventually cx-in-beta is going to be dropped [10:02:04] (03CR) 10Ayounsi: [C: 03+2] "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/556703 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [10:04:22] (03CR) 10Arturo Borrero Gonzalez: wmcs: monitoring: remove rssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/557103 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [10:07:04] (03PS9) 10Jforrester: ContentTranslation: Make available by default for new Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [10:07:07] Nikerabbit, kart_: There, is that helpful? [10:08:41] James_F: lgtm [10:09:26] (03CR) 10jerkins-bot: [V: 04-1] ContentTranslation: Make available by default for new Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [10:10:12] Oh, meh, mowiki isn't real? [10:10:31] it redirects to ro.wikipedia.org [10:10:53] I love it that it is caught by tests [10:11:27] (03PS4) 10Giuseppe Lavagetto: Create common template helpers directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/554833 (https://phabricator.wikimedia.org/T235411) [10:11:29] (03PS2) 10Giuseppe Lavagetto: blubberoid: release new chart version using the common templates directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/556137 (https://phabricator.wikimedia.org/T235411) [10:11:31] (03PS1) 10Giuseppe Lavagetto: cxserver: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/557846 (https://phabricator.wikimedia.org/T235411) [10:11:35] (03PS1) 10Giuseppe Lavagetto: cxserver: enable TLS in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/557847 (https://phabricator.wikimedia.org/T235411) [10:13:26] Nikerabbit: Ah, but we still have a YAML file for it; will fix that. Thanks! And yes, I'm happy my tests work. [10:13:27] 10Operations, 10LDAP-Access-Requests: Requesting access to Turnilo - https://phabricator.wikimedia.org/T240680 (10jcrespo) I've seen your MW account [[ https://www.mediawiki.org/w/index.php?title=User:Easikingarmager | was renamed ]] at some point- that could explain the double LDAP acount. Just to clarify, th... [10:13:51] 10Operations, 10Traffic: vhtcpd segfaulted and did not get restarted - https://phabricator.wikimedia.org/T240826 (10MoritzMuehlenhoff) The vhtcpd.service unit is auto-generated from the vthpcd sysvinit script by means of the systemd-sysv-generator, which simply printfs "Restart=No" to the systemd units it gene... [10:14:20] !log Restarting CI Jenkins due to out of sync state between Zuul Gearman and what is actually running (some jobs got lost) [10:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:20] (03PS10) 10Jforrester: ContentTranslation: Make available by default for new Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [10:16:22] (03PS1) 10Jforrester: Drop mowiki.yaml, renamed wiki, never used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557848 [10:16:50] I broke Zuul/CI/Jenkins due to reloading Jenkins configuration from disk :/ [10:18:10] Fun. [10:18:49] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [10:20:03] (03Merged) 10jenkins-bot: netbox: split generic and device-specific data [software/homer] - 10https://gerrit.wikimedia.org/r/556703 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [10:25:58] (03CR) 10Ayounsi: "> Patch Set 2:" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/550370 (owner: 10Ayounsi) [10:32:05] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: standard recipe and raid1/raid10 [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [10:33:39] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [10:33:49] (03PS12) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [10:37:15] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:41:40] !log delete virtual chassis ID on asw-d-codfw [10:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:53] 10Operations: Track services without a native systemd unit - https://phabricator.wikimedia.org/T240843 (10MoritzMuehlenhoff) [10:57:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, overall, but I 'd rather we did not rely on ruby for grepping a string." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554833 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [10:57:35] !log disable puppet on labstore100[6,7] and stop analytics-related systemd timers - prep step for Kerberos [10:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:43] Cc: apergos, bstorm_ --^ [10:58:56] (03CR) 10Jforrester: [C: 03+1] ContentTranslation: Make available by default for new Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [11:04:07] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:58] (03PS1) 10Ema: ATS: allow to configure server_session_sharing.pool [puppet] - 10https://gerrit.wikimedia.org/r/557883 (https://phabricator.wikimedia.org/T238494) [11:16:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Non-blocking comment inlined." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/557086 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [11:18:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Minor comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/557144 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [11:20:09] (03CR) 10Ema: "sane-looking pcc output here: https://puppet-compiler.wmflabs.org/compiler1003/19985/" [puppet] - 10https://gerrit.wikimedia.org/r/557883 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [11:25:05] (03CR) 10Arturo Borrero Gonzalez: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [11:25:40] (03CR) 10Ema: [C: 03+2] ATS: add SystemTap probe to trace session teardown [puppet] - 10https://gerrit.wikimedia.org/r/557066 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [11:27:00] (03CR) 10Arturo Borrero Gonzalez: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [11:27:11] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557789 (https://phabricator.wikimedia.org/T240814) (owner: 10Ammarpad) [11:27:35] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10faidon) Thanks @ayounsi! Appreciate the follow up. What exactly did you ask them to do in this last communciation? This has been going on for too long, so I think we need a change of str... [11:29:21] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS: cleanup network allocations - https://phabricator.wikimedia.org/T240670 (10aborrero) Needs discussion in the next WMCS team meeting: I would like to double check with you all that doing this cleanup is right, and make sure yo... [11:30:04] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191216T1130). [11:32:14] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558017 (https://phabricator.wikimedia.org/T128546) [11:34:36] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558017 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:34:39] (03CR) 10Vgutierrez: [C: 03+1] ATS: allow to configure server_session_sharing.pool [puppet] - 10https://gerrit.wikimedia.org/r/557883 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [11:35:33] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558017 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:40:21] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:558017| Bumping portals to master (T128546)]] (duration: 00m 56s) [11:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:27] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:41:14] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:558017| Bumping portals to master (T128546)]] (duration: 00m 52s) [11:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:31] jan_drewniak: All clear? [11:47:38] (03PS2) 10Jforrester: Drop mowiki.yaml, renamed wiki, never used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557848 [11:47:43] James_F: yup! [11:47:49] (03CR) 10Jforrester: [C: 03+2] Drop mowiki.yaml, renamed wiki, never used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557848 (owner: 10Jforrester) [11:47:52] Cool. [11:48:38] (03Merged) 10jenkins-bot: Drop mowiki.yaml, renamed wiki, never used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557848 (owner: 10Jforrester) [11:50:02] (Prod clear again.) [11:54:39] (03CR) 10Alexandros Kosiaris: toolforge-k8s: Add a service and notify around kubelet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/557071 (owner: 10Bstorm) [11:55:28] !log Restarting Jenkins completely to flush out stall Gearman functions in Zuul [11:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:39] XioNoX: I am restarting CI :\ [11:57:33] done [11:58:42] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1130', diff saved to https://phabricator.wikimedia.org/P9873 and previous config saved to /var/cache/conftool/dbconfig/20191216-115841-jynus.json [11:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:49] (03Merged) 10jenkins-bot: Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [11:58:53] ^marostegui [11:59:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge-k8s: Add a service and notify around kubelet [puppet] - 10https://gerrit.wikimedia.org/r/557071 (owner: 10Bstorm) [11:59:52] thanks [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191216T1200). [12:00:04] Jhs: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:08] * Jhs is here [12:00:15] * Urbanecm too [12:00:25] Jhs: you able to test that one? [12:00:31] Urbanecm, yup [12:00:38] (03CR) 10Urbanecm: [C: 03+2] Add no=>nb in $wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557838 (https://phabricator.wikimedia.org/T174160) (owner: 10Jon Harald Søby) [12:00:42] let's go for it then [12:00:50] Urbanecm, but i want to test thoroughly since this is the first time we're enabling that config option [12:00:59] sure, I'll give you enough time [12:01:05] 👍 [12:01:32] (03Merged) 10jenkins-bot: Add no=>nb in $wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557838 (https://phabricator.wikimedia.org/T174160) (owner: 10Jon Harald Søby) [12:02:16] Jhs: should be live at mwdebug1001 [12:05:43] 10Operations, 10Wikimedia-Mailing-lists: Close mailing list "Wikivoyage-zh" - https://phabricator.wikimedia.org/T240850 (10Gabrielchihonglee) [12:09:29] Urbanecm, everything looks 100 % right. yay :D [12:09:53] great, thanks Jhs ! [12:10:35] Jhs: this leaves only wikidata to be fixed? [12:10:53] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.8792 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [12:10:54] Nikerabbit, not sure what to fix about Wikidata? [12:11:05] Jhs: the task mentions something about wikidata [12:11:13] ah, let me check [12:11:24] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 026913d: Add no=>nb in $wgInterlanguageLinkCodeMap (T174160) (duration: 00m 53s) [12:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:30] T174160: Language code(s) for nowiki should be changed to nb - https://phabricator.wikimedia.org/T174160 [12:11:35] Jhs: synced :-) [12:12:16] Jhs: https://www.wikidata.org/wiki/Q10270#sitelinks-wikipedia language name is displayed in the tooltip, and it says "norja" for me, maybe that's what the reporter mean [12:12:17] Nikerabbit, ah right. i think that refers to renaming the actual database (nowiki»nbwiki) [12:12:30] !log EU SWAT done [12:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:17] PROBLEM - MegaRAID on db1130 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:14:29] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [12:15:34] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [12:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:31] the :14 alert is T240823, not acknowledging it yet, but host was depooled as a precaution [12:16:31] T240823: db1130 BBU possible issues - https://phabricator.wikimedia.org/T240823 [12:17:19] Thanks a lot James_F and Nikerabbit [12:17:50] kart_: Happy to help. Sorry the system is so complicated. [12:18:06] No worries. Less chance of errors :) [12:18:07] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6333 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [12:18:29] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Samat) I just wanted to report, that 1. yesterday morning I wanted to send out 22 messages with the MassMessages tool, and after I sent them, the "Queu... [12:21:45] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0625 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [12:25:58] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Ymblanter) Are failing uploads on Commons the same issue? [12:38:58] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10jcrespo) Ymblanter: You may want to check T240698 instead. [12:39:39] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10IKhitron) There is also a problem with pageviews data. For example, there is a two days delay in hewiki. There was a 1.5 days delay in ruwiki, and a one... [12:47:52] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10jcrespo) https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&from=1575805469987&to=1576500217867&fullscreen&panelId=5&var-site=eqiad&var-... [12:49:44] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10Gehel) Note that we already set a `Vary: Accept` HTTP header at nginx level (https... [12:52:24] !log shutdown of the Analytics Hadoop cluster to enable Kerberos [12:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:38] please ping the analytics team for any issue --^ [12:56:09] (03PS6) 10Elukey: Enable Kerberos in Hadoop Analytics and Druid Analytics/Public [puppet] - 10https://gerrit.wikimedia.org/r/549566 (https://phabricator.wikimedia.org/T237269) [12:56:20] 10Operations, 10Maps, 10Discovery-Search (Current work): Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Mathew.onipe) [13:07:04] (03CR) 10John Erling Blad: Add no=>nb in $wgInterlanguageLinkCodeMap (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557838 (https://phabricator.wikimedia.org/T174160) (owner: 10Jon Harald Søby) [13:09:27] (03CR) 10John Erling Blad: "Note that this patchset should have a -2 until Wiktionary and Wikiquote are consulted. It is likely they want to go for bokmål as there ar" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557838 (https://phabricator.wikimedia.org/T174160) (owner: 10Jon Harald Søby) [13:16:37] (03CR) 10Elukey: [C: 03+2] Enable Kerberos in Hadoop Analytics and Druid Analytics/Public [puppet] - 10https://gerrit.wikimedia.org/r/549566 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [13:20:25] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install censorship1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10ssingh) Hi. After some more discussion internally, we have decided that we should set the hostname to `cescout1001` instead of `censorship1001` which is what we agreed to ear... [13:25:52] (03PS11) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) [13:28:27] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:58] (03PS1) 10Jcrespo: ldap: Add EAsikingarmager to the list of ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/558041 (https://phabricator.wikimedia.org/T240680) [13:32:03] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:47] !log restarting systemd-timesyncd on stat1005 [13:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:56] !log cp-ats: rolling ats-backend-restart to apply ram cache size changes T238494 [13:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:02] T238494: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [13:34:40] (03CR) 10Jcrespo: [C: 03+2] ldap: Add EAsikingarmager to the list of ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/558041 (https://phabricator.wikimedia.org/T240680) (owner: 10Jcrespo) [13:37:29] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Requesting access to Turnilo - https://phabricator.wikimedia.org/T240680 (10jcrespo) @Easikingarmager Access has been granted, please check you can login to the desired service, and close as resolved or ping me so I can check. [13:39:51] (03CR) 10Phamhi: [C: 03+1] toolforge-k8s: Add a service and notify around kubelet [puppet] - 10https://gerrit.wikimedia.org/r/557071 (owner: 10Bstorm) [13:40:23] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Krenair) Pageview problems are T240803 [13:45:17] (03CR) 10IAmNetx: [C: 04-1] Added bnwikibooks,bnwikisource,ukwikivoyage under wiki hd logos (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557053 (owner: 10TechneSiyam) [13:50:25] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [13:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:21] (03PS4) 10CDanis: dbctl: also read externalLoads from dbctl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556236 (https://phabricator.wikimedia.org/T229686) [13:54:12] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) Removing the serials: ` Hi Jim, No problem to re-state everything as clear as I can if it's the last one :) Serials that should have their install base set to Equinix Ashburn (... [13:55:51] (03PS1) 10Filippo Giunchedi: prometheus: bump Logstash Elasticsearch indexing failures thresholds [puppet] - 10https://gerrit.wikimedia.org/r/558043 (https://phabricator.wikimedia.org/T240667) [14:00:04] cdanis: Your horoscope predicts another unfortunate enable dbctl externalLoads in MediaWiki deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191216T1400). [14:00:17] 10Operations, 10SRE-Access-Requests: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10jcrespo) a:03Nuria As per procedure, this will require also approval of project lead of access requested, in this case @Nuria. [14:00:25] jouncebot: 😘 [14:00:41] (03CR) 10CDanis: [C: 03+2] dbctl: also read externalLoads from dbctl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556236 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:01:54] (03Merged) 10jenkins-bot: dbctl: also read externalLoads from dbctl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556236 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:02:15] RECOVERY - MegaRAID on db1130 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:03:28] !log cdanis@deploy1001 Synchronized wmf-config/etcd.php: enable dbctl for externalLoads 6dfb30c76 T229686 (duration: 00m 53s) [14:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:34] T229686: #dbctl: manage 'externalLoads' data - https://phabricator.wikimedia.org/T229686 [14:03:50] ok marostegui it is live [14:04:03] cdanis: so far so good on my end [14:04:51] cdanis: I am going to give it some more minutes and then I will try to depool a host [14:04:57] marostegui: +1 [14:05:40] 10Operations, 10SRE-Access-Requests: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10jcrespo) I was going to comment the need to sign L3, but I see it was already done: Sat, Dec 14, 00:20 [14:07:58] (03PS1) 10Elukey: Enable kerberos for the Hadoop mapreduce historyserver [puppet] - 10https://gerrit.wikimedia.org/r/558047 [14:08:11] (03PS1) 10Ema: ATS: separate wikidata sessions from others [puppet] - 10https://gerrit.wikimedia.org/r/558048 (https://phabricator.wikimedia.org/T238494) [14:08:33] (03CR) 10Elukey: [C: 03+2] Enable kerberos for the Hadoop mapreduce historyserver [puppet] - 10https://gerrit.wikimedia.org/r/558047 (owner: 10Elukey) [14:08:38] 10Operations, 10SRE-Access-Requests: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10jcrespo) p:05Triage→03High [14:08:50] (03CR) 10jerkins-bot: [V: 04-1] ATS: separate wikidata sessions from others [puppet] - 10https://gerrit.wikimedia.org/r/558048 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:09:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127 from x1 for testing', diff saved to https://phabricator.wikimedia.org/P9874 and previous config saved to /var/cache/conftool/dbconfig/20191216-140951-marostegui.json [14:09:54] cdanis: ^ [14:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:58] checking if the traffic stops [14:10:01] \o/ [14:10:11] yep, looks like it does [14:10:12] \o/ [14:11:05] cdanis: I will change the weights later to make them 100 too [14:11:10] sweet, sounds good [14:11:28] looks like not all the traffic stopped? or maybe the graphs are just lagging [14:11:38] yeah, traffic stopped [14:11:39] no more queries [14:11:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1127 after testing', diff saved to https://phabricator.wikimedia.org/P9875 and previous config saved to /var/cache/conftool/dbconfig/20191216-141141-marostegui.json [14:11:43] oh, no, there it goes [14:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:49] and now the traffic is back [14:11:50] awesome [14:13:16] (03CR) 10Giuseppe Lavagetto: Create common template helpers directory (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554833 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:13:24] (03PS5) 10Giuseppe Lavagetto: Create common template helpers directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/554833 (https://phabricator.wikimedia.org/T235411) [14:13:26] op success [14:14:19] (03PS1) 10Fomafix: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 [14:14:36] 10Operations, 10LDAP-Access-Requests: Requesting access to Turnilo - https://phabricator.wikimedia.org/T240680 (10Easikingarmager) Thank you @jcrespo , just successfully logged in. I will close this ticket as resolved now. [14:14:45] (03PS2) 10Ema: ATS: separate wikidata sessions from others [puppet] - 10https://gerrit.wikimedia.org/r/558048 (https://phabricator.wikimedia.org/T238494) [14:15:14] 10Operations, 10LDAP-Access-Requests: Requesting access to Turnilo - https://phabricator.wikimedia.org/T240680 (10Easikingarmager) 05Open→03Resolved [14:15:27] (03CR) 10jerkins-bot: [V: 04-1] ATS: separate wikidata sessions from others [puppet] - 10https://gerrit.wikimedia.org/r/558048 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:17:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] Create common template helpers directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/554833 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:19:44] (03PS1) 10Jon Harald Søby: Remove Wiktionary and Wikiquote from $wgInterlanguageLinkCodeMap for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558053 (https://phabricator.wikimedia.org/T174160) [14:19:55] (03PS1) 10CDanis: db-codfw: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558054 (https://phabricator.wikimedia.org/T229686) [14:20:02] (03CR) 10jerkins-bot: [V: 04-1] Remove Wiktionary and Wikiquote from $wgInterlanguageLinkCodeMap for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558053 (https://phabricator.wikimedia.org/T174160) (owner: 10Jon Harald Søby) [14:20:52] (03PS3) 10Ema: ATS: separate wikidata sessions from others [puppet] - 10https://gerrit.wikimedia.org/r/558048 (https://phabricator.wikimedia.org/T238494) [14:21:11] (03PS2) 10CDanis: db-codfw: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558054 (https://phabricator.wikimedia.org/T229686) [14:21:13] (03CR) 10jerkins-bot: [V: 04-1] db-codfw: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558054 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:22:02] (03PS3) 10CDanis: db-codfw: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558054 (https://phabricator.wikimedia.org/T229686) [14:25:30] (03CR) 10Marostegui: db-codfw: remove dbctl-obsoleted externalLoads section (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558054 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:26:19] (03CR) 10CDanis: db-codfw: remove dbctl-obsoleted externalLoads section (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558054 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:26:33] (03PS4) 10CDanis: db-codfw: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558054 (https://phabricator.wikimedia.org/T229686) [14:26:39] (03PS3) 10Alexandros Kosiaris: profile::etcd::v3: parametrize adv_client_port [puppet] - 10https://gerrit.wikimedia.org/r/553508 (owner: 10Giuseppe Lavagetto) [14:27:45] (03PS4) 10Alexandros Kosiaris: profile::etcd::v3: parametrize adv_client_port [puppet] - 10https://gerrit.wikimedia.org/r/553508 (owner: 10Giuseppe Lavagetto) [14:27:52] (03CR) 10Marostegui: [C: 03+1] db-codfw: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558054 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:28:32] (03CR) 10CDanis: [C: 03+2] db-codfw: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558054 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:28:54] cdanis: let me know when it is on mwdebug so I can also test codfw [14:29:19] (03Merged) 10jenkins-bot: db-codfw: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558054 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:30:07] marostegui: mwdebug2001 has it [14:30:14] ok, going there! [14:30:21] !log manual testing of I219711eb on mwdebug2001 [14:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Create common template helpers directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/554833 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:30:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: release new chart version using the common templates directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/556137 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:30:53] (03PS3) 10Giuseppe Lavagetto: blubberoid: release new chart version using the common templates directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/556137 (https://phabricator.wikimedia.org/T235411) [14:30:55] (03CR) 10jerkins-bot: [V: 04-1] blubberoid: release new chart version using the common templates directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/556137 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:31:09] <_joe_> interesting [14:31:41] <_joe_> oh gate-and-submit with ff-only [14:32:16] !log delete virtual chassis ID on asw-c-codfw [14:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:29] marostegui: https://logstash.wikimedia.org/goto/beae9b80a35ad70bfacca286202eb398 :) [14:33:57] looks good to me, if you agree I am going to deploy this and then do the same in db-eqiad [14:34:03] !log delete virtual chassis ID on asw-b-codfw [14:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:15] cdanis: I was checking this: https://logstash.wikimedia.org/goto/f3bec92c913a8e5534ce72e7f89e2f85 [14:34:52] but it doesn't look related [14:35:04] which one, the udpsocket thing? [14:35:12] that is unrelated and a known (but not understood) issue on all mwdebug hosts [14:35:17] (03PS4) 10Ema: ATS: separate wikidata sessions from others [puppet] - 10https://gerrit.wikimedia.org/r/558048 (https://phabricator.wikimedia.org/T238494) [14:35:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] profile::etcd::v3: parametrize adv_client_port [puppet] - 10https://gerrit.wikimedia.org/r/553508 (owner: 10Giuseppe Lavagetto) [14:35:52] !log delete virtual chassis ID on asw-a-codfw [14:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:14] cdanis: yeah, I was checking all the errors, but I think they are unrelated [14:36:27] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [14:36:28] Let me check one more thing [14:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:32] ok [14:37:19] cdanis: good looks, let's go for eqiad? :) [14:37:29] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] "IDs manually removed, everything good." [homer/public] - 10https://gerrit.wikimedia.org/r/550370 (owner: 10Ayounsi) [14:38:20] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [14:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:30] (03PS1) 10CDanis: db-eqiad: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558062 (https://phabricator.wikimedia.org/T229686) [14:39:44] !log cdanis@deploy1001 Synchronized wmf-config/etcd.php: db-codfw: remove dbctl-obsoleted externalLoads section 519e37461 T229686 (duration: 00m 53s) [14:39:48] (03PS1) 10Alexandros Kosiaris: profile::etcd::v3: Fix reassignment of $adv_client_port [puppet] - 10https://gerrit.wikimedia.org/r/558063 [14:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:50] T229686: #dbctl: manage 'externalLoads' data - https://phabricator.wikimedia.org/T229686 [14:39:51] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [14:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] profile::etcd::v3: Fix reassignment of $adv_client_port [puppet] - 10https://gerrit.wikimedia.org/r/558063 (owner: 10Alexandros Kosiaris) [14:42:05] marostegui: https://gerrit.wikimedia.org/r/558062 [14:42:09] checking [14:42:39] (03CR) 10Marostegui: [C: 03+1] db-eqiad: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558062 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:42:50] (03CR) 10CDanis: [C: 03+2] db-eqiad: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558062 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:43:25] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10IKhitron) Interactive history shows irrelevant timestamps. [14:43:41] marostegui: there's no reason that a mediawiki in eqiad should ever need the ip:port of a mysql in codfw, right? (or vice versa) [14:43:49] (03Merged) 10jenkins-bot: db-eqiad: remove dbctl-obsoleted externalLoads section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558062 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:45:08] !log cdanis@deploy1001 Synchronized wmf-config/db-codfw.php: db-codfw: remove dbctl-obsoleted externalLoads section 519e37461 T229686 (duration: 00m 54s) [14:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:17] T229686: #dbctl: manage 'externalLoads' data - https://phabricator.wikimedia.org/T229686 [14:46:05] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10WDoranWMF) @Pchelolo is unfortunately off today and Tuesday of this week. We'll dig into this now and try to find a resolution. [14:46:31] !log cdanis@deploy1001 Synchronized wmf-config/db-eqiad.php: db-eqiad: remove dbctl-obsoleted externalLoads section 5413a6d73 T229686 (duration: 00m 54s) [14:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1121 after schema change', diff saved to https://phabricator.wikimedia.org/P9876 and previous config saved to /var/cache/conftool/dbconfig/20191216-144902-marostegui.json [14:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:42] (03PS3) 10Phamhi: wmcs: monitoring: remove rssh [puppet] - 10https://gerrit.wikimedia.org/r/557103 (https://phabricator.wikimedia.org/T224585) [14:50:32] (03PS2) 10Alexandros Kosiaris: role::etcd::v3::kubernetes: Add role for etcd3 backing k8s [puppet] - 10https://gerrit.wikimedia.org/r/553512 (owner: 10Giuseppe Lavagetto) [14:53:02] 10Operations, 10Traffic: Secure shared ticket key rotation for anycast authdns - https://phabricator.wikimedia.org/T240863 (10BBlack) [14:54:48] (03PS1) 10Ottomata: Set g+s on all yarn user cache directories [puppet] - 10https://gerrit.wikimedia.org/r/558067 (https://phabricator.wikimedia.org/T237269) [14:55:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1084 schema change', diff saved to https://phabricator.wikimedia.org/P9877 and previous config saved to /var/cache/conftool/dbconfig/20191216-145520-marostegui.json [14:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:00] (03CR) 10Elukey: [C: 03+1] Set g+s on all yarn user cache directories [puppet] - 10https://gerrit.wikimedia.org/r/558067 (https://phabricator.wikimedia.org/T237269) (owner: 10Ottomata) [14:56:13] (03PS2) 10Ottomata: Set g+s on all yarn user cache directories [puppet] - 10https://gerrit.wikimedia.org/r/558067 (https://phabricator.wikimedia.org/T237269) [14:56:15] (03CR) 10Phamhi: wmcs: monitoring: remove rssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/557103 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [14:56:33] (03CR) 10Ayounsi: [C: 03+2] "Tested with Ib6fc392d751e8d29b1c488586c83007529e3b524 and works as expected." [software/homer] - 10https://gerrit.wikimedia.org/r/550375 (owner: 10Ayounsi) [14:57:10] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Set g+s on all yarn user cache directories [puppet] - 10https://gerrit.wikimedia.org/r/558067 (https://phabricator.wikimedia.org/T237269) (owner: 10Ottomata) [14:58:12] !log ✔️ cdanis@mwdebug2001.codfw.wmnet ~ 🕤☕ scap pull [14:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:09] (03Merged) 10jenkins-bot: netbox: add vlan support [software/homer] - 10https://gerrit.wikimedia.org/r/550375 (owner: 10Ayounsi) [15:00:46] 10Operations, 10Traffic: Create a system for distributed shared secret material to server tmps - https://phabricator.wikimedia.org/T240866 (10BBlack) [15:02:14] 10Operations, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team): Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10WDoranWMF) [15:04:27] 10Operations, 10User-fgiunchedi: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955 (10fgiunchedi) [15:13:05] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:14:22] (03PS3) 10Alexandros Kosiaris: role::etcd::v3::kubernetes: Add role for etcd3 backing k8s [puppet] - 10https://gerrit.wikimedia.org/r/553512 (owner: 10Giuseppe Lavagetto) [15:15:55] !log elukey@cumin1001 START - Cookbook sre.druid.roll-restart-workers [15:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:56] 10Operations, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team): Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10dcausse) Looking at the graph "Rate of committed offset increment" from https://grafana.wikimedia.org/d/000000484/kafka-con... [15:20:58] !log mforns@deploy1001 Started deploy [analytics/refinery@1c72a71]: deploying analytics refinery for kerberos migration [15:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:48] (03PS1) 10Ema: ATS: disable compress plugin in text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/558089 (https://phabricator.wikimedia.org/T238494) [15:26:26] <_joe_> uh what just happened to the appservers [15:27:12] <_joe_> the usual surge in memcached requests is causing this [15:28:30] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler1002/19990/" [puppet] - 10https://gerrit.wikimedia.org/r/558089 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [15:28:55] !log mforns@deploy1001 Finished deploy [analytics/refinery@1c72a71]: deploying analytics refinery for kerberos migration (duration: 07m 57s) [15:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:34] (03CR) 10Vgutierrez: [C: 03+1] ATS: disable compress plugin in text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/558089 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [15:29:38] <_joe_> incoming [15:29:39] (03PS2) 10Giuseppe Lavagetto: cxserver: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/557846 (https://phabricator.wikimedia.org/T235411) [15:29:41] (03PS2) 10Giuseppe Lavagetto: cxserver: enable TLS in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/557847 (https://phabricator.wikimedia.org/T235411) [15:29:43] (03PS1) 10Giuseppe Lavagetto: citoid: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558091 (https://phabricator.wikimedia.org/T235411) [15:29:45] (03PS1) 10Giuseppe Lavagetto: mathoid: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558092 (https://phabricator.wikimedia.org/T235411) [15:29:47] (03PS1) 10Giuseppe Lavagetto: termbox: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558093 (https://phabricator.wikimedia.org/T235411) [15:29:49] (03PS1) 10Elukey: Use kerberos to check /mnt/hdfs on stat*/notebook* [puppet] - 10https://gerrit.wikimedia.org/r/558094 [15:30:11] (03CR) 10Ema: [C: 03+2] ATS: allow to configure server_session_sharing.pool [puppet] - 10https://gerrit.wikimedia.org/r/557883 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [15:31:05] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/558094 (owner: 10Elukey) [15:31:59] (03CR) 10Elukey: [C: 03+2] Use kerberos to check /mnt/hdfs on stat*/notebook* [puppet] - 10https://gerrit.wikimedia.org/r/558094 (owner: 10Elukey) [15:37:27] (03CR) 10Nikerabbit: [C: 03+1] ContentTranslation: Make available by default for new Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [15:41:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) [15:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:31] (03CR) 10Herron: [C: 03+1] prometheus: bump Logstash Elasticsearch indexing failures thresholds [puppet] - 10https://gerrit.wikimedia.org/r/558043 (https://phabricator.wikimedia.org/T240667) (owner: 10Filippo Giunchedi) [15:46:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cxserver: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/557846 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [15:47:05] (03CR) 10Jhedden: openstack: add ceph rbd support for nova-compute (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/557086 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [15:47:15] (03Merged) 10jenkins-bot: cxserver: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/557846 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [15:50:55] (03CR) 10Vgutierrez: [C: 03+1] ATS: stop logging BereqURL at the TLS layer too [puppet] - 10https://gerrit.wikimedia.org/r/556142 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [15:51:41] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:51:59] 10Operations, 10DNS, 10Research, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10leila) @jcrespo makes sense. I'll schedule a meeting for after the holidays. :) (btw, I'm not sure when I can resolve this task. Please go ahead and do it if it's consider... [15:55:41] (03CR) 10Bstorm: toolforge-k8s: enable encryption at rest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/557144 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [15:57:05] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:57:42] (03PS1) 10TechneSiyam: Added betawikiversity,hiwikibooks,ukwikinews hd logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558103 [15:58:21] We are going to do the Jenkins upgrades [15:58:35] (03PS2) 10Bstorm: toolforge-k8s: enable encryption at rest [puppet] - 10https://gerrit.wikimedia.org/r/557144 (https://phabricator.wikimedia.org/T240009) [16:03:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights from 1 to 100 on x1 slaves in eqiad and codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9880 and previous config saved to /var/cache/conftool/dbconfig/20191216-160346-marostegui.json [16:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:52] T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 [16:04:06] 10Operations, 10SRE-Access-Requests: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10Nuria) Approving on my end [16:05:50] !log installing spamassassin security updates [16:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] role::etcd::v3::kubernetes: Add role for etcd3 backing k8s [puppet] - 10https://gerrit.wikimedia.org/r/553512 (owner: 10Giuseppe Lavagetto) [16:07:40] (03PS1) 10TechneSiyam: Modified IS.php with betawikiveristy,ukwikines,hiwikibooks under wghdlogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558106 [16:09:46] (03PS2) 10Jon Harald Søby: Remove Wiktionary and Wikiquote from $wgInterlanguageLinkCodeMap for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558053 (https://phabricator.wikimedia.org/T174160) [16:12:56] !log elukey@cumin1001 START - Cookbook sre.druid.roll-restart-workers [16:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:59] !log Upgrading https://releases-jenkins.wikimedia.org/ [16:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:14] !log Restarting CI Jenkins [16:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:21] (03Abandoned) 10Alexandros Kosiaris: k8s: Add roles to new etcd hosts [puppet] - 10https://gerrit.wikimedia.org/r/556324 (https://phabricator.wikimedia.org/T239838) (owner: 10Alexandros Kosiaris) [16:18:51] 10Operations, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team): Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Mholloway) [16:19:46] 10Operations, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team): Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Mholloway) I am about to deploy a backported fix for `fetchGoogleCloudVisionAnnotations` to wmf.10. [16:19:49] 10Operations, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Pipeline): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris) [16:19:56] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) 05Open→03Stalled [16:19:59] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Core Platform Team Legacy (Watching / External), and 3 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [16:22:41] 10Operations, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team): Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Urbanecm) >>! In T240518#5744203, @Ymblanter wrote: > Are failing uploads on Commons the same issue? >>! In T240518#574421... [16:23:37] (03PS1) 10Ottomata: Change eventgate-logging-external TLS port to 4392 [deployment-charts] - 10https://gerrit.wikimedia.org/r/558115 [16:24:32] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/MachineVision: Fix: Bail out of label fetching job if local file not found (T240733) (duration: 00m 59s) [16:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:37] T240733: TypeError from line 92 of /srv/mediawiki/php-1.35.0-wmf.10/extensions/MachineVision/src/Client/GoogleCloudVisionClient.php: Argument 2 passed to MediaWiki\Extension\MachineVision\Client\GoogleCloudVisionClient::fetchAnnotations() must be an instance of LocalFile, boolean given - https://phabricator.wikimedia.org/T240733 [16:25:21] (03PS1) 10Ottomata: Change eventgate-logging-external TLS port to 4392 [puppet] - 10https://gerrit.wikimedia.org/r/558117 [16:25:37] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification - https://phabricator.wikimedia.org/T240455 (10Bawolff) I can also confirm if i compile from (source `Version: ImageMagick 7.0.9-8 Q16 x86_64 2019-12-16`) that identi... [16:25:50] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10Bawolff) [16:27:22] !log Jenkins CI: upgrading collapsing console section to 1.8.0 # T236222 / T239985 [16:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:28] T239985: Upgrade Jenkins to 2.190.3 - https://phabricator.wikimedia.org/T239985 [16:28:06] (03PS1) 10Jbond: puppet-merge test: test puppet-merge deployed to one puppet master [puppet] - 10https://gerrit.wikimedia.org/r/558118 [16:28:20] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet-merge test: test puppet-merge deployed to one puppet master [puppet] - 10https://gerrit.wikimedia.org/r/558118 (owner: 10Jbond) [16:36:23] (03PS6) 10Ottomata: Allow labstore hosts to contact Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) [16:36:59] (03PS1) 10Alexandros Kosiaris: Temporarily ship k8s3 certs [puppet] - 10https://gerrit.wikimedia.org/r/558119 (https://phabricator.wikimedia.org/T239838) [16:37:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights from 1 to 100 on es1 slaves in eqiad and codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9881 and previous config saved to /var/cache/conftool/dbconfig/20191216-163712-marostegui.json [16:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:19] T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 [16:41:04] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19994/" [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [16:41:15] (03PS10) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [16:41:19] (03PS1) 10Jbond: test commit [labs/private] - 10https://gerrit.wikimedia.org/r/558120 [16:41:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: bump Logstash Elasticsearch indexing failures thresholds [puppet] - 10https://gerrit.wikimedia.org/r/558043 (https://phabricator.wikimedia.org/T240667) (owner: 10Filippo Giunchedi) [16:42:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) [16:42:09] (03CR) 10CRusnov: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/550053 (owner: 10CRusnov) [16:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:20] 10Operations, 10SRE-Access-Requests: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10jcrespo) This should be now only pending on 3-business day wait, although I will ask some analytics sres a doubt I have, meanwhile. [16:42:31] 10Operations, 10SRE-Access-Requests: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10jcrespo) a:05Nuria→03None [16:45:11] !log elukey@cumin1001 START - Cookbook sre.druid.roll-restart-workers [16:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:20] 10Operations: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10colewhite) [16:45:23] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:45:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] test commit [labs/private] - 10https://gerrit.wikimedia.org/r/558120 (owner: 10Jbond) [16:45:34] 10Operations: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10colewhite) p:05Triage→03Low [16:47:13] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:53:55] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2048 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:54:55] PROBLEM - Check systemd state on ms-be2048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:00] (03PS3) 10CRusnov: netbox: move to netbox-extras repository [puppet] - 10https://gerrit.wikimedia.org/r/554962 [16:57:45] (03CR) 10Volans: "Some minor comments inline." (0311 comments) [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/556732 (owner: 10Ssingh) [16:58:59] (03CR) 10CRusnov: [C: 03+2] netbox: move to netbox-extras repository [puppet] - 10https://gerrit.wikimedia.org/r/554962 (owner: 10CRusnov) [17:01:27] !log Restarting CI Jenkins for plugins updates [17:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:36] !log start batch indexing of minwiktionary into cirrussearch [17:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:07] 10Operations, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team): Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Mholloway) [17:04:30] (03PS1) 10Elukey: Add hive settings for Kerberos in Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/558127 [17:05:25] (03CR) 10Ottomata: [C: 03+1] Add hive settings for Kerberos in Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/558127 (owner: 10Elukey) [17:05:40] (03CR) 10Elukey: [C: 03+2] Add hive settings for Kerberos in Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/558127 (owner: 10Elukey) [17:06:33] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1123 - https://phabricator.wikimedia.org/T240534 (10Marostegui) >>! In T240534#5736762, @wiki_willy wrote: > @Jclark-ctr - looks like this one is still under warranty, so you should be able to just RMA it. Thanks, Willy If we could try to RMA it today, a... [17:06:41] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:06:58] 10Operations, 10observability: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 (10colewhite) 05Open→03Resolved [17:08:29] (03PS1) 10Ottomata: Put hive-site.xml into HDFS as analytics user [puppet] - 10https://gerrit.wikimedia.org/r/558128 (https://phabricator.wikimedia.org/T237269) [17:09:54] 10Operations, 10Traffic: Implement DNS-over-TLS for AuthDNS - https://phabricator.wikimedia.org/T239994 (10BBlack) Refactoring the dependencies a little here: Really (2) above's sub-point about shared ticket key rotation won't matter until we're anycasting, so I've made a separate task (+subtask) in T240863 to... [17:10:06] (03PS2) 10Ottomata: Put hive-site.xml into HDFS as analytics user [puppet] - 10https://gerrit.wikimedia.org/r/558128 (https://phabricator.wikimedia.org/T237269) [17:10:49] (03CR) 10Ottomata: [C: 03+2] Put hive-site.xml into HDFS as analytics user [puppet] - 10https://gerrit.wikimedia.org/r/558128 (https://phabricator.wikimedia.org/T237269) (owner: 10Ottomata) [17:10:52] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Put hive-site.xml into HDFS as analytics user [puppet] - 10https://gerrit.wikimedia.org/r/558128 (https://phabricator.wikimedia.org/T237269) (owner: 10Ottomata) [17:12:36] (03PS3) 10Ottomata: Put hive-site.xml into HDFS as analytics user [puppet] - 10https://gerrit.wikimedia.org/r/558128 (https://phabricator.wikimedia.org/T237269) [17:12:58] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Put hive-site.xml into HDFS as analytics user [puppet] - 10https://gerrit.wikimedia.org/r/558128 (https://phabricator.wikimedia.org/T237269) (owner: 10Ottomata) [17:14:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) [17:14:21] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:23] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.2167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:19:57] 10Operations, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team): Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Joe) I see sometimes an error like the following: ` [2019-12-16T17:14:19.865Z] ERROR: cpjobqueue/8023 on scb1001: Producer... [17:22:10] !log anomie@deploy1001 Synchronized php-1.35.0-wmf.10/includes/api/ApiQueryUserContribs.php: Backporting fix for T240808 (duration: 00m 59s) [17:22:16] (03PS1) 10Jbond: stunnel: add stunnel module [puppet] - 10https://gerrit.wikimedia.org/r/558133 [17:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:18] T240808: Usercontribs API returning results for a different user - https://phabricator.wikimedia.org/T240808 [17:22:40] <_joe_> mdholloway: I see Query: INSERT INTO `machine_vision_image` (mvi_sha1,mvi_rand) VALUES ('tihdeocsuykui8b4cqiq36hauptwnsl','0.8776146945905') [17:22:41] <_joe_> Function: MediaWiki\Extension\MachineVision\Repository::insertLabels [17:22:43] <_joe_> Error: 1062 Duplicate entry ... for key 'mvi_sha1' [17:22:50] <_joe_> err sorry [17:25:21] <_joe_> mdholloway: so I see still errors from the google cloud vision thing [17:25:46] (03PS2) 10Alexandros Kosiaris: Temporarily ship k8s3 certs [puppet] - 10https://gerrit.wikimedia.org/r/558119 (https://phabricator.wikimedia.org/T239838) [17:27:17] (03PS1) 10BryanDavis: cloud: add ::apparmor dependency to ::lxc for Buster [puppet] - 10https://gerrit.wikimedia.org/r/558136 (https://phabricator.wikimedia.org/T240875) [17:27:35] (03CR) 10Elukey: [C: 03+2] Enable kerberos on labstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/557084 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [17:27:50] (03CR) 10Elukey: Enable kerberos on labstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/557084 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [17:27:59] (03CR) 10jerkins-bot: [V: 04-1] cloud: add ::apparmor dependency to ::lxc for Buster [puppet] - 10https://gerrit.wikimedia.org/r/558136 (https://phabricator.wikimedia.org/T240875) (owner: 10BryanDavis) [17:28:09] (03PS2) 10Elukey: Move Analytics systemd timers on labstore nodes to local /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/557083 (https://phabricator.wikimedia.org/T234229) [17:28:16] _joe_: how frequently are you seeing duplicate entry errors into machine_vision_image? that's something that's foreseeable under the right circumstances, but should be rare and shouldn't be happening at all right now. (the expectation would be that it would happen if we have two or more image labeling providers enabled, which isn't currently the case.) [17:28:17] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:28:53] on the PHP side, we ignore the error, re-query for the existing row ID, and carry on [17:29:18] <_joe_> oh ok [17:29:28] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10jcrespo) I mistakenly assumed I needed SRE agreement because of root, was reminded I... [17:30:14] (03CR) 10Elukey: [C: 03+2] Move Analytics systemd timers on labstore nodes to local /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/557083 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [17:30:18] i suppose it could also happen in the case of a duplicate image being uploaded, since we identify images internally by SHA1 digest. [17:30:31] (03PS2) 10Elukey: Enable kerberos on labstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/557084 (https://phabricator.wikimedia.org/T234229) [17:31:39] <_joe_> mdholloway: I see a 500 server error though [17:31:49] <_joe_> in the kibana dashboard for jobqueue eventbus [17:32:04] looking [17:32:11] (03PS2) 10BryanDavis: cloud: add ::apparmor dependency to ::profile::wmcs::mediawiki_vagrant [puppet] - 10https://gerrit.wikimedia.org/r/558136 (https://phabricator.wikimedia.org/T240875) [17:32:49] (03CR) 10Elukey: [C: 03+2] Enable kerberos on labstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/557084 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [17:33:59] (03CR) 10jerkins-bot: [V: 04-1] cloud: add ::apparmor dependency to ::profile::wmcs::mediawiki_vagrant [puppet] - 10https://gerrit.wikimedia.org/r/558136 (https://phabricator.wikimedia.org/T240875) (owner: 10BryanDavis) [17:34:19] (03PS3) 10Alexandros Kosiaris: Temporarily ship k8s3 certs [puppet] - 10https://gerrit.wikimedia.org/r/558119 (https://phabricator.wikimedia.org/T239838) [17:36:12] argh. the style guide is killing me. [17:36:20] * bd808 tries to remember how to puppet [17:36:23] (03PS1) 10Elukey: role::dumps::distribution::server: fix include [puppet] - 10https://gerrit.wikimedia.org/r/558137 [17:36:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] Temporarily ship k8s3 certs [puppet] - 10https://gerrit.wikimedia.org/r/558119 (https://phabricator.wikimedia.org/T239838) (owner: 10Alexandros Kosiaris) [17:38:17] (03PS3) 10BryanDavis: cloud: add ::apparmor dependency to ::profile::wmcs::mediawiki_vagrant [puppet] - 10https://gerrit.wikimedia.org/r/558136 (https://phabricator.wikimedia.org/T240875) [17:38:19] (03CR) 10Elukey: [C: 03+2] role::dumps::distribution::server: fix include [puppet] - 10https://gerrit.wikimedia.org/r/558137 (owner: 10Elukey) [17:38:42] _joe_: lots of duplicates, too, which needs investigation. i'm going to disable enqueuing new labeling request jobs while we investigate this. [17:39:11] (03PS1) 10Ema: vhtcpd: convert service to systemd [puppet] - 10https://gerrit.wikimedia.org/r/558138 (https://phabricator.wikimedia.org/T240826) [17:39:24] <_joe_> mdholloway: thanks [17:40:07] PROBLEM - Etcd cluster health on kubetcd1004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [17:40:59] ^^ expected? [17:42:49] (03PS1) 10Mholloway: MachineVision: Disable enqueuing new labeling request jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558141 (https://phabricator.wikimedia.org/T240518) [17:42:58] (03PS1) 10Elukey: role::dumps::distribution::server: add Hadoop-related hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/558142 [17:44:18] (03CR) 10Mholloway: [V: 03+2 C: 03+2] MachineVision: Disable enqueuing new labeling request jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558141 (https://phabricator.wikimedia.org/T240518) (owner: 10Mholloway) [17:44:38] (03PS2) 10Ema: vhtcpd: convert service to systemd [puppet] - 10https://gerrit.wikimedia.org/r/558138 (https://phabricator.wikimedia.org/T240826) [17:44:41] (03CR) 10Elukey: [C: 03+2] role::dumps::distribution::server: add Hadoop-related hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/558142 (owner: 10Elukey) [17:45:25] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: hw troubleshooting: hardware RAID predictive failure for bellatrix.frack.codfw.wmnet - https://phabricator.wikimedia.org/T240876 (10Jgreen) [17:45:39] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: hw troubleshooting: hardware RAID predictive failure for bellatrix.frack.codfw.wmnet - https://phabricator.wikimedia.org/T240876 (10Jgreen) a:03Papaul [17:45:53] PROBLEM - Check systemd state on kubestagetcd1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:06] thanks are fine ^ [17:46:07] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:46:14] those are fine ^, it's a new service [17:46:33] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 56s) [17:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:38] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: hw troubleshooting: hardware RAID predictive failure for bellatrix.frack.codfw.wmnet - https://phabricator.wikimedia.org/T240876 (10Jgreen) Please note this server is still in service. [17:46:41] PROBLEM - Check systemd state on kubestagetcd1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:54] !log disabled enqueuing new MachineVision label request jobs [17:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:57] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2048 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:49:32] (03PS2) 10Jbond: stunnel: add stunnel module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/558133 [17:49:34] (03PS1) 10Elukey: Avoid to deploy spark2 on labstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/558144 [17:50:39] (03PS3) 10Ema: vhtcpd: convert service to systemd [puppet] - 10https://gerrit.wikimedia.org/r/558138 (https://phabricator.wikimedia.org/T240826) [17:51:28] (03CR) 10jerkins-bot: [V: 04-1] stunnel: add stunnel module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/558133 (owner: 10Jbond) [17:52:05] PROBLEM - Etcd cluster health on kubestagetcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [17:53:18] (03PS2) 10Elukey: Avoid to deploy spark2 on labstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/558144 [17:55:24] (03CR) 10Elukey: [C: 03+2] Avoid to deploy spark2 on labstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/558144 (owner: 10Elukey) [17:56:03] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/MachineVision: Fix: Ignore duplicate entry errors on insertLabels (T240518) (duration: 00m 57s) [17:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:09] T240518: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 [17:56:39] RECOVERY - Check systemd state on ms-be2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:43] PROBLEM - Check systemd state on kubestagetcd1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:04] gehel and onimisionipe: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191216T1800). [18:00:14] jouncebot: on it! [18:01:23] PROBLEM - etcd service on kubestagetcd1005 is CRITICAL: CRITICAL - Expecting active but unit etcd is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:03:09] !log mobrovac@deploy1001 Started restart [cpjobqueue/deploy@deafe56]: Rolling restart of CP4JQ -- T240518 [18:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:14] T240518: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 [18:04:09] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@665d9d3]: New WDQS build [18:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:30] (03PS6) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) [18:08:21] PROBLEM - Etcd cluster health on kubetcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [18:08:21] PROBLEM - Etcd cluster health on kubestagetcd1004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [18:08:23] PROBLEM - Etcd cluster health on kubetcd2005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [18:08:34] (03CR) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [18:08:37] <_joe_> akosiaris: need any help? [18:08:55] (03PS5) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [18:10:58] <_joe_> akosiaris: http vs https is your problem in the peer urls [18:11:07] _joe_: yeah I noticed it [18:11:12] and it's the DNS [18:11:34] (03PS1) 10Jcrespo: admin: Add brennen to contint-roots and releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/558151 (https://phabricator.wikimedia.org/T240382) [18:11:39] turns out I need to add -ssl [18:11:49] <_joe_> yes, damn etcd conventions :P [18:12:57] PROBLEM - Etcd cluster health on kubestagetcd1006 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [18:12:57] PROBLEM - etcd service on kubestagetcd1004 is CRITICAL: CRITICAL - Expecting active but unit etcd is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:13:43] (03CR) 10Jcrespo: [C: 03+2] admin: Add brennen to contint-roots and releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/558151 (https://phabricator.wikimedia.org/T240382) (owner: 10Jcrespo) [18:13:46] (03PS1) 10Alexandros Kosiaris: k8s: make etcd3 SRV DNS records TLS specific [dns] - 10https://gerrit.wikimedia.org/r/558152 (https://phabricator.wikimedia.org/T239838) [18:14:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: make etcd3 SRV DNS records TLS specific [dns] - 10https://gerrit.wikimedia.org/r/558152 (https://phabricator.wikimedia.org/T239838) (owner: 10Alexandros Kosiaris) [18:16:09] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10Patch-For-Review: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10jcrespo) @brennen Change is deployed, please allow up to 30 min... [18:17:02] RECOVERY - Check systemd state on kubestagetcd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:02] PROBLEM - etcd service on kubestagetcd1006 is CRITICAL: CRITICAL - Expecting active but unit etcd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:17:32] RECOVERY - Check systemd state on kubestagetcd1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:53] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@665d9d3]: New WDQS build (duration: 13m 44s) [18:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:18] RECOVERY - Etcd cluster health on kubestagetcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [18:18:38] RECOVERY - etcd service on kubestagetcd1004 is OK: OK - etcd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:18:38] RECOVERY - Etcd cluster health on kubestagetcd1006 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [18:19:04] RECOVERY - etcd service on kubestagetcd1006 is OK: OK - etcd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:19:04] RECOVERY - Check systemd state on kubestagetcd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:41] mdholloway: it seems machinevision problems are still happening, e.g. https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-mediawiki-2019.12.16/mediawiki?id=AW8P63WXKWrIH1QRuRDL&_g=h@c411f17 [18:19:46] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@665d9d3]: New WDQS build [18:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:52] PROBLEM - Etcd cluster health on kubetcd1006 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [18:19:54] PROBLEM - Etcd cluster health on kubetcd2004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [18:19:54] PROBLEM - Etcd cluster health on kubetcd2006 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [18:20:10] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/MachineVision: (T240518) (duration: 00m 57s) [18:20:12] mobrovac: yes, deploying another patch now which should stop the retries [18:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:15] T240518: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 [18:21:00] kk thnx mdholloway [18:21:06] (03PS4) 10Ssingh: Add script for fetching routing information from RIPEstat [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/556732 [18:21:24] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@665d9d3]: New WDQS build (duration: 01m 38s) [18:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:30] RECOVERY - Etcd cluster health on kubestagetcd1004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [18:23:52] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10RobH) p:05Triage→03Normal [18:23:57] 10Operations, 10ops-codfw, 10Wikimedia-Logstash: rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10RobH) p:05Triage→03Normal [18:24:06] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10RobH) [18:24:10] RECOVERY - Etcd cluster health on kubetcd1004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [18:24:14] 10Operations, 10ops-codfw, 10Wikimedia-Logstash: rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10RobH) [18:24:22] RECOVERY - Etcd cluster health on kubetcd1006 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [18:24:24] RECOVERY - Etcd cluster health on kubetcd2006 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [18:24:24] RECOVERY - Etcd cluster health on kubetcd2004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [18:24:27] (03CR) 10Ssingh: Add script for fetching routing information from RIPEstat (0310 comments) [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/556732 (owner: 10Ssingh) [18:24:46] RECOVERY - Etcd cluster health on kubetcd2005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [18:25:14] RECOVERY - Etcd cluster health on kubetcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [18:25:14] RECOVERY - etcd service on kubestagetcd1005 is OK: OK - etcd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:26:04] mobrovac: so basically s4/commons processing has halted completely for 5 days, and other wikis unaffected? [18:26:09] mobrovac: one more patch needed. [18:26:40] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10RobH) [18:28:21] !log mobrovac@deploy1001 Started restart [cpjobqueue/deploy@deafe56]: (no justification provided) [18:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:48] (03PS1) 10Effie Mouzeli: mediawiki::php::admin fix lib.php [puppet] - 10https://gerrit.wikimedia.org/r/558158 (https://phabricator.wikimedia.org/T240824) [18:28:57] (03PS3) 10Bstorm: toolforge-k8s: enable encryption at rest [puppet] - 10https://gerrit.wikimedia.org/r/557144 (https://phabricator.wikimedia.org/T240009) [18:29:04] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@665d9d3]: New WDQS build [18:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:06] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@665d9d3]: New WDQS build (duration: 01m 01s) [18:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:26] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@665d9d3]: New WDQS build [18:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:19] 10Operations, 10SRE-Access-Requests: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10jcrespo) @kzimmerman One small correction, to make sure the request is correct, you asked for `analtyics-wmde`, we can grant `anal... [18:31:20] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@665d9d3]: New WDQS build (duration: 00m 53s) [18:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:42] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: enable encryption at rest [puppet] - 10https://gerrit.wikimedia.org/r/557144 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [18:33:07] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/MachineVision: Catch DB duplicate key errors, cont. (T240518) (duration: 00m 55s) [18:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:12] T240518: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 [18:37:05] Krinkle: it's job-based, not wiki-based [18:37:57] mobrovac: hm.. so MachineVersion issues are not behind recentChangeUpdate building up? [18:38:10] *Vision [18:38:36] I thought maybe the type error broke something early on in job init logic [18:39:24] we'll know for sure once that is completely fixed [18:39:39] (03PS1) 10Elukey: role::dumps::distribution::server: simplify hdfs mountpoint code [puppet] - 10https://gerrit.wikimedia.org/r/558165 [18:39:45] but it's not liekly as recentchange jobs are being accumulated no matter what [18:40:56] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/20002/labstore1006.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/558165 (owner: 10Elukey) [18:45:44] mobrovac: OK, looks like the cpjobqueue errors are squashed now that we're catching the underlying error in PHP. One thing that concerns me, though, is that the job insertion rate for fetchGoogleCloudVisionAnnotations hasn't slowed even though I disabled that an hour ago. Is this because we're using `jobReleaseTimestamp`? How is this implemented? Are jobs actually inserted into the queue immediately, or after the [18:45:44] jobReleaseTimestamp has passed? [18:46:58] (03CR) 10Bstorm: toolforge-k8s: Add a service and notify around kubelet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/557071 (owner: 10Bstorm) [18:47:06] (03PS2) 10Bstorm: toolforge-k8s: Add a service and notify around kubelet [puppet] - 10https://gerrit.wikimedia.org/r/557071 [18:47:31] If the latter, then I'm afraid that we won't see the effect of disabling new jobs for another 48h, at least without some additional intervention to kill the pending jobs. [18:48:03] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: Add a service and notify around kubelet [puppet] - 10https://gerrit.wikimedia.org/r/557071 (owner: 10Bstorm) [18:50:08] mdholloway: jobs get created during the request and then dispatched after the output has been flushed back to the client, so i don't see how it is possible that they get created if they are disabled [18:50:17] mdholloway: are they being created recursively? [18:53:36] No, no recursion. The only place they should be created is in an UploadComplete hook handler, after checking that `$wgMachineVisionRequestLabelsOnUploadComplete` is true. I double-checked that it's false on commonswiki in shell.php, and it is, so I'm not sure what's going on, either. [18:53:59] there is a backlog of them [18:54:01] It is still true on testcommonswiki, but that should be a negligible number. [18:54:05] so they need to get processed [18:54:05] Aha. [18:54:14] !log mobrovac@deploy1001 Started restart [cpjobqueue/deploy@deafe56]: (no justification provided) [18:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:39] 10Operations: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10jcrespo) "all employees should have" It was agreed some time ago that not all employees were to be added to the wmf group. The only requirements were to: 1) Request it 2) Have a reasonable reason for... [18:56:43] (03PS1) 10Ottomata: otto .liquidpromptrc setting - show time [puppet] - 10https://gerrit.wikimedia.org/r/558168 [18:58:45] (03CR) 10Ottomata: [C: 03+2] otto .liquidpromptrc setting - show time [puppet] - 10https://gerrit.wikimedia.org/r/558168 (owner: 10Ottomata) [18:58:49] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@665d9d3]: New WDQS build - redeploy to fix issue on wdqs1007 [18:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191216T1900). [19:00:05] matthiasmullie and Jhs: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:15] * Jhs is present [19:00:29] o/ [19:00:43] Jhs: my patch will take ~half hour to land - feel free to go first! [19:00:58] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@665d9d3]: New WDQS build - redeploy to fix issue on wdqs1007 (duration: 02m 09s) [19:01:01] mdholloway: you are done with your deployment(s)? [19:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:12] matthiasmullie, okies 👍 :) [19:01:12] Urbanecm: yes, done, thanks! [19:01:29] (03CR) 10Urbanecm: [C: 03+2] Remove Wiktionary and Wikiquote from $wgInterlanguageLinkCodeMap for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558053 (https://phabricator.wikimedia.org/T174160) (owner: 10Jon Harald Søby) [19:01:34] thanks mdholloway [19:04:42] !log depool maps2002 for postgres init - T239728 [19:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:48] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [19:06:55] (03PS4) 10Jdlrobson: Add minerva custom logo for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [19:07:49] (03CR) 10Jdlrobson: "Thanks as always Ammar!" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [19:09:34] (03PS3) 10Urbanecm: Remove Wiktionary and Wikiquote from $wgInterlanguageLinkCodeMap for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558053 (https://phabricator.wikimedia.org/T174160) (owner: 10Jon Harald Søby) [19:09:41] hmm i think CI is lazy and no, merge conflict :/ [19:09:44] sorry for the delay Jhs [19:09:49] (03CR) 10Urbanecm: [C: 03+2] Remove Wiktionary and Wikiquote from $wgInterlanguageLinkCodeMap for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558053 (https://phabricator.wikimedia.org/T174160) (owner: 10Jon Harald Søby) [19:09:50] no worries [19:10:07] okay, now the jobs finally runs [19:10:30] <_joe_> Urbanecm: oh? [19:10:35] <_joe_> oh you mean CI [19:10:41] _joe_: ah, sorry, I meant CI [19:10:47] (03Merged) 10jenkins-bot: Remove Wiktionary and Wikiquote from $wgInterlanguageLinkCodeMap for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558053 (https://phabricator.wikimedia.org/T174160) (owner: 10Jon Harald Søby) [19:10:48] <_joe_> ahah :P [19:11:14] Jhs: could you verify at mwdebug1001, please? [19:11:21] Urbanecm, sure, checking [19:12:31] Urbanecm, yup, looks as it should 👍 [19:12:37] thx, syncing [19:13:03] Urbanecm, and sorry about the double-work, but at least now jeblad should be happy :) [19:13:13] np Jhs [19:13:57] (03PS3) 10Urbanecm: Add namespace aliases for zhwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556847 (https://phabricator.wikimedia.org/T240428) (owner: 10Ammarpad) [19:14:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 4541c4a: Remove Wiktionary and Wikiquote from $wgInterlanguageLinkCodeMap for now (T174160) (duration: 00m 57s) [19:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:11] T174160: Language code(s) for nowiki should be changed to nb - https://phabricator.wikimedia.org/T174160 [19:14:15] (03CR) 10Urbanecm: [C: 03+2] Add namespace aliases for zhwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556847 (https://phabricator.wikimedia.org/T240428) (owner: 10Ammarpad) [19:14:31] Jhs: here you are [19:15:06] (03Merged) 10jenkins-bot: Add namespace aliases for zhwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556847 (https://phabricator.wikimedia.org/T240428) (owner: 10Ammarpad) [19:15:10] Urbanecm, grreat. Alright, putting some kids to bed now. ttyl [19:15:17] ttyl Jhs [19:16:51] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: ced7842: Add namespace aliases for zhwikiquote (T240428) (duration: 00m 56s) [19:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:56] T240428: Create four namespace aliases for zhwikiquote - https://phabricator.wikimedia.org/T240428 [19:17:23] !log mwscript namespaceDupes.php --wiki=zhwikiquote --fix (T240428) [19:17:25] !log mobrovac@deploy1001 Started deploy [cpjobqueue/deploy@0047875]: Do not consume the fetchGoogleCloudVisionAnnotations topic -- T240518 [19:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:34] T240518: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 [19:18:25] !log mobrovac@deploy1001 Finished deploy [cpjobqueue/deploy@0047875]: Do not consume the fetchGoogleCloudVisionAnnotations topic -- T240518 (duration: 01m 00s) [19:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:03] (03PS2) 10Urbanecm: Add additional import sources for zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557789 (https://phabricator.wikimedia.org/T240814) (owner: 10Ammarpad) [19:19:05] (03Abandoned) 10C. Scott Ananian: Ensure fatal PHP7 errors from the Parsoid cluster don't spam mediawiki logs [puppet] - 10https://gerrit.wikimedia.org/r/554638 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [19:19:08] (03CR) 10Urbanecm: [C: 03+2] Add additional import sources for zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557789 (https://phabricator.wikimedia.org/T240814) (owner: 10Ammarpad) [19:20:02] (03Merged) 10jenkins-bot: Add additional import sources for zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557789 (https://phabricator.wikimedia.org/T240814) (owner: 10Ammarpad) [19:21:42] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 6e518be: Add additional import sources for zhwikisource (T240814) (duration: 00m 56s) [19:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:47] T240814: Add import sources for zhwikisource - https://phabricator.wikimedia.org/T240814 [19:23:06] matthiasmullie: go ahead with your patch, please let me know once you're done, so i can continue :-) [19:23:17] ok cool - shouldn't take long [19:23:32] ok [19:25:57] (03PS1) 10Elukey: Add analytics-privatedata keytab to stat/notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/558174 [19:26:37] (03CR) 10Elukey: [C: 03+2] Add analytics-privatedata keytab to stat/notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/558174 (owner: 10Elukey) [19:27:11] !log mlitn@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/WikibaseMediaInfo: Override getSitelink in mediainfo table, instead of removing it (duration: 00m 56s) [19:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:57] Urbanecm: done [19:28:01] thanks matthiasmullie [19:31:44] !log mobrovac@deploy1001 Started deploy [cpjobqueue/deploy@1efbc29]: Increase concurrency for low traffic jobs -- T240518 [19:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:50] T240518: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 [19:32:30] !log mobrovac@deploy1001 Finished deploy [cpjobqueue/deploy@1efbc29]: Increase concurrency for low traffic jobs -- T240518 (duration: 00m 46s) [19:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:15] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [19:38:43] (03PS5) 10Urbanecm: Use editautoreviewprotected for autoreview protection level only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529039 (https://phabricator.wikimedia.org/T230103) [19:43:45] (03PS6) 10Urbanecm: Use editautoreviewprotected for autoreview protection level only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529039 (https://phabricator.wikimedia.org/T230103) [19:44:19] (03CR) 10Urbanecm: [C: 03+2] Use editautoreviewprotected for autoreview protection level only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529039 (https://phabricator.wikimedia.org/T230103) (owner: 10Urbanecm) [19:45:10] (03Merged) 10jenkins-bot: Use editautoreviewprotected for autoreview protection level only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529039 (https://phabricator.wikimedia.org/T230103) (owner: 10Urbanecm) [19:47:05] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [19:47:54] (03PS5) 10Herron: ganeti: assign ganeti400[123] role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/555761 (https://phabricator.wikimedia.org/T226444) [19:48:57] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: f4cd6d0: Use editautoreviewprotected for autoreview protection level only (T230103) (duration: 00m 57s) [19:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:03] T230103: Systematize wgRestrictionLevels - https://phabricator.wikimedia.org/T230103 [19:51:46] (03PS1) 10Herron: add dummy ulsfo ganeti RAPI key to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/558185 (https://phabricator.wikimedia.org/T226444) [19:52:33] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [19:53:42] !log mwscript renameRestrictions.php --wiki=arwiki 'autoreview' 'editautoreviewprotected' (T230103) [19:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:03] !log mwscript renameRestrictions.php --wiki=dewiktionary 'autoreviewprotected' 'editautoreviewprotected' (T230103) [19:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:08] T230103: Systematize wgRestrictionLevels - https://phabricator.wikimedia.org/T230103 [19:54:15] !log mobrovac@deploy1001 Started deploy [cpjobqueue/deploy@9423e7e]: Increase concurrency for low traffic jobs even further -- T240518 [19:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:20] T240518: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 [19:54:56] (03CR) 10Herron: [V: 03+2 C: 03+2] add dummy ulsfo ganeti RAPI key to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/558185 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [19:55:04] !log mobrovac@deploy1001 Finished deploy [cpjobqueue/deploy@9423e7e]: Increase concurrency for low traffic jobs even further -- T240518 (duration: 00m 49s) [19:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:54] !log mwscript renameRestrictions.php --wiki=ptwiki 'autoreviewer' 'editautoreviewprotected' (T230103) [19:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:39] (03PS1) 10Urbanecm: Remove custom protection level for ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558190 [20:01:11] (03CR) 10Urbanecm: [C: 03+2] Remove custom protection level for ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558190 (owner: 10Urbanecm) [20:02:03] (03Merged) 10jenkins-bot: Remove custom protection level for ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558190 (owner: 10Urbanecm) [20:03:32] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 9d7530e: Remove custom protection level for ptwikinews (duration: 00m 57s) [20:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:39] !log Morning SWAT done [20:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:43] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/MachineVision: Fix: Restore suggestion randomization (duration: 01m 00s) [20:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:56] 10Operations, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), 10Patch-For-Review: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10mobrovac) p:05Unbreak!→03High Lowering the priorit... [20:12:40] (03PS1) 10Halfak: Fixes url for wikilabels repo. [puppet] - 10https://gerrit.wikimedia.org/r/558196 [20:13:10] (03PS2) 10Halfak: Fixes url for wikilabels repo. [puppet] - 10https://gerrit.wikimedia.org/r/558196 (https://phabricator.wikimedia.org/T236546) [20:17:11] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:18:03] !log restart php on mw1326 [20:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:59] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:25:55] !log restart php on mw1330 [20:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:31] 10Operations, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), 10Patch-For-Review: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Mholloway) For reference, the relevant Job class is ht... [20:29:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1084 after schema change', diff saved to https://phabricator.wikimedia.org/P9883 and previous config saved to /var/cache/conftool/dbconfig/20191216-202902-marostegui.json [20:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1081 schema change', diff saved to https://phabricator.wikimedia.org/P9884 and previous config saved to /var/cache/conftool/dbconfig/20191216-203202-marostegui.json [20:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:35] (03CR) 10BryanDavis: [C: 03+1] "Verified via cherry-pick to a project puppetmaster in the mediawiki-vagrant project" [puppet] - 10https://gerrit.wikimedia.org/r/558136 (https://phabricator.wikimedia.org/T240875) (owner: 10BryanDavis) [20:34:32] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Mholloway) [20:36:28] (03PS6) 10Herron: ganeti: assign ganeti400[123] role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/555761 (https://phabricator.wikimedia.org/T226444) [20:40:07] (03CR) 10Herron: [C: 03+2] ganeti: assign ganeti400[123] role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/555761 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [20:45:41] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@4e72559]: Update mobileapps to 9118b44 [20:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:47] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@4e72559]: Update mobileapps to 9118b44 (duration: 07m 06s) [20:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191216T2100). [21:02:47] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [21:03:27] (03PS1) 10Herron: ganeti: use 'drbd-utils' package on buster and beyond [puppet] - 10https://gerrit.wikimedia.org/r/558214 (https://phabricator.wikimedia.org/T226444) [21:10:01] (03PS2) 10Herron: ganeti: use 'drbd-utils' package on buster and beyond [puppet] - 10https://gerrit.wikimedia.org/r/558214 (https://phabricator.wikimedia.org/T226444) [21:11:27] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/20008/" [puppet] - 10https://gerrit.wikimedia.org/r/558214 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [21:20:35] PROBLEM - parsoid on scandium is CRITICAL: connect to address 10.64.48.94 and port 8142: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [21:22:19] (03CR) 10Jhedden: [C: 03+2] cloud: add ::apparmor dependency to ::profile::wmcs::mediawiki_vagrant [puppet] - 10https://gerrit.wikimedia.org/r/558136 (https://phabricator.wikimedia.org/T240875) (owner: 10BryanDavis) [21:27:49] RECOVERY - parsoid on scandium is OK: HTTP OK: HTTP/1.1 200 OK - 1535 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [21:28:03] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [21:34:05] (03PS1) 10BryanDavis: toolforge: Add CORS header to docker registry [puppet] - 10https://gerrit.wikimedia.org/r/558220 (https://phabricator.wikimedia.org/T232135) [21:37:03] (03PS5) 10Ssingh: Add script for fetching routing information from RIPEstat [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/556732 [21:52:40] 10Operations, 10Mail: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2 - https://phabricator.wikimedia.org/T240906 (10herron) p:05Triage→03Normal [21:53:29] !log taking over mwdebug2001 to do some testing [21:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:51] !log arlolra@deploy1001 Started deploy [parsoid/deploy@a42ca13]: Updating Parsoid to 56a64ef [21:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:59] 10Operations, 10Mail: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2 - https://phabricator.wikimedia.org/T240906 (10herron) Looked into these alerts a bit, and pulled the source IP addresses for these checks from watchmouse, but I don't see these IPs appearing in the mx logs. I thin... [21:59:07] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar), and 2 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Urbanecm) Seems to happen w/ mwdebug1001 too? https://l... [21:59:46] Urbanecm: yes, ef.fie pointed that out at the start of her last post :) [22:00:04] Reedy and sbassett: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191216T2200). [22:00:29] ah, thanks cdanis - was confused by the title which still ays mwdebug1002 [22:01:00] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar), and 2 others: All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10CDanis) [22:01:02] good point :) [22:01:02] * Urbanecm wonders if that still means we should avoid using that host? [22:01:05] no [22:01:10] and I am going to take down that motd [22:01:29] ok. maybe an info to ops@ would be a good idea too [22:02:07] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@a42ca13]: Updating Parsoid to 56a64ef (duration: 08m 16s) [22:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:35] (03PS1) 10Herron: mx: log smtp_connection details [puppet] - 10https://gerrit.wikimedia.org/r/558228 (https://phabricator.wikimedia.org/T240906) [22:05:08] (03CR) 10Herron: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/20009/" [puppet] - 10https://gerrit.wikimedia.org/r/558228 (https://phabricator.wikimedia.org/T240906) (owner: 10Herron) [22:07:01] !log increasing mx exim log verbosity by adding smtp_connection to log_selector list T240906 [22:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:10] T240906: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2 - https://phabricator.wikimedia.org/T240906 [22:17:58] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [22:23:54] (03PS1) 10BBlack: dotls: ssl tweaks [puppet] - 10https://gerrit.wikimedia.org/r/558234 (https://phabricator.wikimedia.org/T239994) [22:23:56] (03PS1) 10BBlack: dotls: enable on all servers [puppet] - 10https://gerrit.wikimedia.org/r/558235 (https://phabricator.wikimedia.org/T239994) [22:24:46] !log ✔️ cdanis@mwdebug2001.codfw.wmnet /srv/mediawiki 🕔🍺 scap pull [22:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:26] (03PS2) 10BBlack: dotls: ssl tweaks [puppet] - 10https://gerrit.wikimedia.org/r/558234 (https://phabricator.wikimedia.org/T239994) [22:27:28] (03PS2) 10BBlack: dotls: enable on all servers [puppet] - 10https://gerrit.wikimedia.org/r/558235 (https://phabricator.wikimedia.org/T239994) [22:29:24] (03CR) 10BBlack: [C: 03+2] dotls: ssl tweaks [puppet] - 10https://gerrit.wikimedia.org/r/558234 (https://phabricator.wikimedia.org/T239994) (owner: 10BBlack) [22:31:54] (03PS1) 10Bstorm: toolforge-k8s: fix the formatting in the kubeadm init config [puppet] - 10https://gerrit.wikimedia.org/r/558238 (https://phabricator.wikimedia.org/T240009) [22:31:57] (03CR) 10BBlack: [C: 03+2] dotls: enable on all servers [puppet] - 10https://gerrit.wikimedia.org/r/558235 (https://phabricator.wikimedia.org/T239994) (owner: 10BBlack) [22:35:43] (03PS2) 10Bstorm: toolforge-k8s: fix the formatting in the kubeadm init config [puppet] - 10https://gerrit.wikimedia.org/r/558238 (https://phabricator.wikimedia.org/T240009) [22:36:47] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: fix the formatting in the kubeadm init config [puppet] - 10https://gerrit.wikimedia.org/r/558238 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [22:37:52] cdanis: love the emojis :P [22:40:26] !log arlolra@deploy1001 Started deploy [parsoid/deploy@26ee446]: Updating Parsoid to 8ccc085 [22:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:43] (03PS1) 10Ladsgroup: Add a bit for forcing LC caching backend in cli mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558239 (https://phabricator.wikimedia.org/T105683) [22:40:48] thanks SPF|Cloud, they're my shell prompt :) [22:46:41] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01138 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [22:47:20] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@26ee446]: Updating Parsoid to 8ccc085 (duration: 06m 54s) [22:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:33] (03PS1) 10Bstorm: toolforge-k8s: typo in encryption config [puppet] - 10https://gerrit.wikimedia.org/r/558241 [22:49:46] I think I may have caused that "widespread puppet agent failures", and I think it was a temporary bump [22:50:30] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: typo in encryption config [puppet] - 10https://gerrit.wikimedia.org/r/558241 (owner: 10Bstorm) [22:56:04] (03CR) 10BryanDavis: "* registry1001.eqiad.wmnet: noop -- https://puppet-compiler.wmflabs.org/compiler1002/20010/" [puppet] - 10https://gerrit.wikimedia.org/r/558220 (https://phabricator.wikimedia.org/T232135) (owner: 10BryanDavis) [22:57:32] !log Updated Parsoid to 8ccc085 (T240091, T236912, T236415, T239929, T214649, T239830) [22:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:44] T240091: "PHP Notice: Undefined offset: 2" in WikitextEscapeHandlers - https://phabricator.wikimedia.org/T240091 [22:57:44] T236912: Return value of Parsoid\Wt2Html\PP\Processors\WrapSections::getDSR() must be of the type integer, null returned - https://phabricator.wikimedia.org/T236912 [22:57:44] T236415: Edge case category link output difference - https://phabricator.wikimedia.org/T236415 [22:57:44] T214649: VE's gallery representation differs enough so that selser is never applied? - https://phabricator.wikimedia.org/T214649 [22:57:45] T239929: Class 'Parsoid\Language\LanguageKk' not found - https://phabricator.wikimedia.org/T239929 [22:57:45] T239830: Add metrics for startup time for language variant code - https://phabricator.wikimedia.org/T239830 [22:57:54] (03PS2) 10BryanDavis: toolforge: Add CORS header to docker registry [puppet] - 10https://gerrit.wikimedia.org/r/558220 (https://phabricator.wikimedia.org/T232135) [22:59:40] (03CR) 10BryanDavis: "> Uploaded patch set 2." [puppet] - 10https://gerrit.wikimedia.org/r/558220 (https://phabricator.wikimedia.org/T232135) (owner: 10BryanDavis) [23:01:07] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002134 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [23:02:57] "widespread puppet agent failures" is the name of my next ska band [23:04:20] <_joe_> that sounds more like the name for a death metal band [23:05:50] I think icinga config might be broken again [23:05:50] (03PS3) 10BryanDavis: cloud: update maintain-views to handle dblists with comments [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) [23:06:05] (03CR) 10BryanDavis: cloud: update maintain-views to handle dblists with comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) (owner: 10BryanDavis) [23:06:06] checking it out [23:06:58] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Joe) Thank you @mobrovac for the help! [23:08:12] Error: Could not find any hostgroup matching 'ganeti_ulsfo' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 19178) [23:11:40] (03PS1) 10BBlack: Add cluster defs for edge ganetis [puppet] - 10https://gerrit.wikimedia.org/r/558247 (https://phabricator.wikimedia.org/T226444) [23:12:26] (03CR) 10BBlack: [C: 03+2] Add cluster defs for edge ganetis [puppet] - 10https://gerrit.wikimedia.org/r/558247 (https://phabricator.wikimedia.org/T226444) (owner: 10BBlack) [23:20:31] (03PS1) 10Bstorm: toolforge-k8s: a very small, critical typo in the encryption config [puppet] - 10https://gerrit.wikimedia.org/r/558251 [23:21:05] (03CR) 10Bstorm: [C: 03+1] "Looks like it might do the thing." [puppet] - 10https://gerrit.wikimedia.org/r/558220 (https://phabricator.wikimedia.org/T232135) (owner: 10BryanDavis) [23:22:13] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: a very small, critical typo in the encryption config [puppet] - 10https://gerrit.wikimedia.org/r/558251 (owner: 10Bstorm) [23:26:55] (fixed) [23:29:13] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [23:34:12] 10Operations, 10Traffic: Implement DNS-over-TLS for AuthDNS - https://phabricator.wikimedia.org/T239994 (10BBlack) Actually we can't realistically do global monitoring from icinga either, because icinga isn't on Buster and so it doesn't have the right library/tool access to check a TLSv1.3-only service, so we'... [23:37:58] 10Operations, 10Traffic: Implement DNS-over-TLS for AuthDNS - https://phabricator.wikimedia.org/T239994 (10BBlack) External queries now working (note they all return a codfw IP without edns-client-subnet in play, because codfw is closest to my laptop and PROXYv2 is working for sending the "real" client IP fro... [23:38:48] 10Operations, 10Traffic: Implement DNS-over-TLS for AuthDNS - https://phabricator.wikimedia.org/T239994 (10BBlack) 05Open→03Resolved a:03BBlack [23:42:30] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10RobH) >>! In T237009#5693855, @mark wrote: > cr2-esams now has its power cables labeled: > > PEM0: 20145 (to ps1 outlet 32) > PEM1: 20146 (to ps2 outlet 32) > PEM2: 20147 (to ps1 o... [23:53:24] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10RobH) >>! In T237009#5626952, @Papaul wrote: > Power cords label information for servers in rack OE14 > |Server| |ps1| | ps2| > |lvs3005| |20080| |20088| > |ganet...