[00:02:14] (03PS1) 10CRusnov: netbox: Use netbox-extras version of ganeti-sync [puppet] - 10https://gerrit.wikimedia.org/r/594602 [00:04:07] (03PS2) 10CRusnov: netbox: Use netbox-extras version of tools [puppet] - 10https://gerrit.wikimedia.org/r/594602 [00:04:24] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/594602 (owner: 10CRusnov) [00:05:49] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:08:05] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:08:23] RECOVERY - Check the last execution of netbox_ganeti_eqsin_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:24:26] (03CR) 10Krinkle: "@Hashar I was mailed by WMDE and by a volunteer." [puppet] - 10https://gerrit.wikimedia.org/r/593344 (owner: 10Krinkle) [00:34:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:34:57] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_eventgate_main_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:35:05] PROBLEM - eventgate-main LVS eqiad on eventgate-main.svc.eqiad.wmnet is CRITICAL: / (root with no query params) is CRITICAL: Test root with no query params returned the unexpected status 503 (expecting: 404): / (doc from root) is CRITICAL: Test doc from root returned the unexpected status 503 (expecting: 200): / (root with wrong query param) is CRITICAL: Test root with wrong query param returned the unexpected status 503 (expecti [00:35:05] wikitech.wikimedia.org/wiki/Event_Platform/EventGate [00:36:47] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:36:55] RECOVERY - eventgate-main LVS eqiad on eventgate-main.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [00:38:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:48:35] PROBLEM - dump of m3 in eqiad on db1115 is CRITICAL: dump for m3 at eqiad taken more than 8 days ago: Most recent backup 2020-04-28 00:44:31 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:50:43] PROBLEM - BGP status on cr3-knams is CRITICAL: BGP CRITICAL - AS1257/IPv4: Active - Tele2, AS1257/IPv6: Active - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:18:07] RECOVERY - BGP status on cr3-knams is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:50:55] PROBLEM - dump of x1 in codfw on db1115 is CRITICAL: dump for x1 at codfw taken more than 8 days ago: Most recent backup 2020-04-28 02:23:34 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:02:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251122 (10Nuria) @ahemmer Are you planning to access hadoop for data? it seems that what you need is rather access to https://turnilo.... [04:05:06] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251122 (10Nuria) @colewhite I see access patches to hadoop private data are merged but let's hear from @ahemmer . I do not think he ne... [04:35:13] RECOVERY - Host analytics1060 is UP: PING WARNING - Packet loss = 90%, RTA = 491.11 ms [04:37:25] 10Operations, 10observability: check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10Joe) While it's clear that 400 alerts flooding production are not great, this check is important for each single machine. So we can aggregate the output, but we can't suppress i... [04:39:26] 10Operations, 10serviceops: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards - https://phabricator.wikimedia.org/T251378 (10Joe) @elukey let's schedule this test for 6:00Z on monday, May 11th? [04:39:39] PROBLEM - SSH on analytics1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:43:15] PROBLEM - Host analytics1060 is DOWN: PING CRITICAL - Packet loss = 100% [04:44:17] ^checking [04:50:20] 10Operations, 10Analytics: Analytics1060 unresponsive - https://phabricator.wikimedia.org/T251973 (10Marostegui) [04:50:22] elukey: created this ^ [04:57:27] (03PS1) 10Marostegui: install_server: Allow reimage db114[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/594616 (https://phabricator.wikimedia.org/T251614) [05:02:11] !log Deploy schema change on db1121 [05:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1103:3314 in vslow on s4 while db1121 is out T250055', diff saved to https://phabricator.wikimedia.org/P11157 and previous config saved to /var/cache/conftool/dbconfig/20200506-050340-marostegui.json [05:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:44] T250055: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 [05:35:10] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] contint: Add redirect from stale doc dirs to current ones [puppet] - 10https://gerrit.wikimedia.org/r/593344 (owner: 10Krinkle) [05:59:38] marostegui: <3 [06:00:01] !log powercycle analytics1060 - host stuck - T251973 [06:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:04] T251973: Analytics1060 unresponsive - https://phabricator.wikimedia.org/T251973 [06:01:07] sometimes those lock ups happen [06:03:51] RECOVERY - SSH on analytics1060 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:03:53] RECOVERY - Host analytics1060 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [06:50:00] elukey: is that something usual? [06:51:17] (03PS1) 10KartikMistry: Enable ContentTranslation in Armenian Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594668 (https://phabricator.wikimedia.org/T249229) [07:06:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/594516 (owner: 10Elukey) [07:10:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. It's worth checking if other analytics services also collect user-specific settings in /tmp, /run/user is the ideal place for " [puppet] - 10https://gerrit.wikimedia.org/r/594519 (owner: 10Elukey) [07:17:08] (03CR) 10Muehlenhoff: [C: 04-1] "The packaging looks fine, I only lefts some smaller nits. But there's the issue that the current Anaconda includes proprietary binaries, w" (037 comments) [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [07:20:20] (03CR) 10Muehlenhoff: admin: add ahemmer to analytics-privatedata-users and researchers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594593 (https://phabricator.wikimedia.org/T251122) (owner: 10Cwhite) [07:22:12] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access for Superset for PDas - https://phabricator.wikimedia.org/T251516 (10MoritzMuehlenhoff) >>! In T251516#6111024, @colewhite wrote: > Hi Praveen! > > I appears you are a WMF employee, so this is likely a request for membership to the `wmf`... [07:22:33] marostegui: can you give https://phabricator.wikimedia.org/T251980 a look? [07:22:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/594598 (https://phabricator.wikimedia.org/T251516) (owner: 10Cwhite) [07:22:55] (03CR) 10Dzahn: [C: 03+1] "looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/594598 (https://phabricator.wikimedia.org/T251516) (owner: 10Cwhite) [07:26:10] (03CR) 10Ema: [C: 03+2] vtc test box: install py3 version of jenkinsapi [puppet] - 10https://gerrit.wikimedia.org/r/594411 (owner: 10Ema) [07:27:29] (03CR) 10Dzahn: [C: 03+2] add static-codereview.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/594501 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [07:27:44] (03PS2) 10Dzahn: add static-codereview.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/594501 (https://phabricator.wikimedia.org/T243056) [07:32:06] !log downgrade to ATS 8.0.7-1wm3 on cp4026, cp4031, cp5006 and cp5011 [07:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:10] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:32:59] 10Operations, 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10jcrespo) [07:33:02] 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) [07:33:07] 10Operations, 10observability, 10Goal, 10Patch-For-Review, 10Sustainability: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) [07:33:11] 10Operations, 10DBA, 10observability: Display lag on grafana (prometheus) and dbtree from pt-heartbeat instead (or in addition) of Seconds_Behind_Master - https://phabricator.wikimedia.org/T141968 (10jcrespo) [07:35:46] 10Operations, 10DBA, 10Traffic, 10Patch-For-Review: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462 (10jcrespo) [07:38:56] !log kormat@cumin1001 dbctl commit (dc=all): 'Set es1023 (es5 master) to 0 weight after reimaging es1024 T250666', diff saved to https://phabricator.wikimedia.org/P11158 and previous config saved to /var/cache/conftool/dbconfig/20200506-073856-kormat.json [07:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:59] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [07:43:34] (03CR) 10Kormat: [C: 04-1] "One minor comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594616 (https://phabricator.wikimedia.org/T251614) (owner: 10Marostegui) [07:45:11] (03PS2) 10Kormat: install_server: switch d-i-test to buster [puppet] - 10https://gerrit.wikimedia.org/r/594494 (https://phabricator.wikimedia.org/T251768) [07:45:59] (03CR) 10Kormat: [C: 03+2] install_server: switch d-i-test to buster [puppet] - 10https://gerrit.wikimedia.org/r/594494 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [07:46:13] (03CR) 10Marostegui: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594616 (https://phabricator.wikimedia.org/T251614) (owner: 10Marostegui) [07:48:37] (03CR) 10Kormat: [C: 03+1] "> The reason I don't put it that way is because hosts will be reimaged at different times (as they normally get racked at different times)" [puppet] - 10https://gerrit.wikimedia.org/r/594616 (https://phabricator.wikimedia.org/T251614) (owner: 10Marostegui) [07:49:26] (03PS3) 10Ema: Add flag -frontend_delay [software/purged] - 10https://gerrit.wikimedia.org/r/594454 (https://phabricator.wikimedia.org/T249583) [07:51:24] (03PS1) 10Dzahn: add profile to setup static-codereview.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/594669 (https://phabricator.wikimedia.org/T243056) [07:51:35] (03PS2) 10Marostegui: install_server: Allow reimage db114[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/594616 (https://phabricator.wikimedia.org/T251614) [07:52:08] (03CR) 10Ema: Add flag -frontend_delay (033 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/594454 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [07:52:23] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db114[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/594616 (https://phabricator.wikimedia.org/T251614) (owner: 10Marostegui) [07:56:00] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Marostegui) [07:56:29] (03PS2) 10Ema: cache: remove profile::cache::base::use_purged [puppet] - 10https://gerrit.wikimedia.org/r/593193 (https://phabricator.wikimedia.org/T251374) [08:00:15] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Dzahn) @Legoktm I added the new name to DNS and started with the puppet patch to create the httpd config. Some que... [08:00:59] (03CR) 10Dzahn: "WIP. some questions at https://phabricator.wikimedia.org/T243056#6111775" [puppet] - 10https://gerrit.wikimedia.org/r/594669 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [08:01:36] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Dzahn) a:03Dzahn [08:01:45] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Marostegui) Puppet done to get them as spare. As I said, I can do the install myself, what is pending from DCOPs side (apart from racking and ca... [08:02:46] <_joe_> !log restarted php-fpm with tweaked parameters on mw1407, now briefly pooling for traffic (T99740) [08:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:49] T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 [08:02:59] (03CR) 10Ema: [C: 03+2] cache: remove profile::cache::base::use_purged [puppet] - 10https://gerrit.wikimedia.org/r/593193 (https://phabricator.wikimedia.org/T251374) (owner: 10Ema) [08:04:22] (03CR) 10Dzahn: "arr, almost duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/567407 that i forgot already exists" [puppet] - 10https://gerrit.wikimedia.org/r/594669 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [08:14:28] <_joe_> jouncebot: next [08:14:28] In 2 hour(s) and 45 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200506T1100) [08:18:20] (03PS1) 10Vgutierrez: ATS: Add the client TTFB to ats-tls log [puppet] - 10https://gerrit.wikimedia.org/r/594670 (https://phabricator.wikimedia.org/T244538) [08:19:07] (03PS2) 10Ema: varnish: remove varnish::htcppurger [puppet] - 10https://gerrit.wikimedia.org/r/593194 (https://phabricator.wikimedia.org/T251374) [08:20:38] (03CR) 10Ema: [C: 03+2] "pcc noop https://puppet-compiler.wmflabs.org/compiler1001/22324/" [puppet] - 10https://gerrit.wikimedia.org/r/593194 (https://phabricator.wikimedia.org/T251374) (owner: 10Ema) [08:24:18] 10Operations, 10serviceops, 10Continuous-Integration-Config: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10MoritzMuehlenhoff) >>! In T251918#6109976, @hashar wrote: > Do you have details as to why the releng/bazel:0.4.0 fails? No, what's I opened a task so that... [08:29:22] (03PS3) 10Giuseppe Lavagetto: Revert "Enable LCStoreStaticArray on depooled mw1407 for benchmarking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592867 (https://phabricator.wikimedia.org/T99740) [08:29:34] (03PS4) 10Giuseppe Lavagetto: Revert "Enable LCStoreStaticArray on depooled mw1407 for benchmarking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592867 (https://phabricator.wikimedia.org/T99740) [08:30:42] PROBLEM - PHP opcache health on mw1407 is CRITICAL: CRITICAL: opcache full. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:32:40] (03PS4) 10Dzahn: Add profile and module for for static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [08:33:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Enable LCStoreStaticArray on depooled mw1407 for benchmarking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592867 (https://phabricator.wikimedia.org/T99740) (owner: 10Giuseppe Lavagetto) [08:33:47] (03CR) 10jerkins-bot: [V: 04-1] Add profile and module for for static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [08:34:19] (03Merged) 10jenkins-bot: Revert "Enable LCStoreStaticArray on depooled mw1407 for benchmarking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592867 (https://phabricator.wikimedia.org/T99740) (owner: 10Giuseppe Lavagetto) [08:34:20] PROBLEM - MariaDB Slave Lag: x1 on db2101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 760.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:34:39] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Dzahn) Sorry, i forgot that you had already started https://gerrit.wikimedia.org/r/c/operations/puppet/+/567407 do... [08:35:11] <_joe_> don't worry about mw1407, it's depooled [08:36:24] (03PS1) 10Dzahn: Revert "icinga: replace check_ssl_http with check_ssl_http_letsencrypt" [puppet] - 10https://gerrit.wikimedia.org/r/594672 [08:37:05] ack, thanks joe [08:38:15] checking that x1 slave [08:38:33] ah, that's the backups [08:38:41] jynus: should we disable notifications for that one? [08:39:19] 10Operations, 10Traffic, 10serviceops: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) revert: https://gerrit.wikimedia.org/r/c/operations/puppet/+/594672 [08:39:42] RECOVERY - PHP opcache health on mw1407 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:43:37] marostegui: but it shouldn't lag [08:43:40] checking [08:43:45] (03PS5) 10Dzahn: Add profile and module for for static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [08:43:54] !log oblivian@deploy1001 Synchronized wmf-config/CommonSettings.php: Reverting change on mw1407 T99740 (duration: 01m 16s) [08:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:59] T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 [08:45:39] 10Operations: ps1-a7-codfw - monitoring alerts - https://phabricator.wikimedia.org/T251987 (10Dzahn) [08:46:10] 10Operations: ps1-a7-codfw - monitoring alerts - https://phabricator.wikimedia.org/T251987 (10Dzahn) {F31804600} [08:46:46] 10Operations, 10ops-codfw: ps1-a7-codfw - monitoring alerts - https://phabricator.wikimedia.org/T251987 (10Dzahn) [08:47:10] I think that is an actual problem, I cannot event connect to the server [08:47:14] ACKNOWLEDGEMENT - ps1-a7-codfw-infeed-load-tower-A-phase-X on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn https://phabricator.wikimedia.org/T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:47:14] ACKNOWLEDGEMENT - ps1-a7-codfw-infeed-load-tower-A-phase-Y on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn https://phabricator.wikimedia.org/T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:47:14] ACKNOWLEDGEMENT - ps1-a7-codfw-infeed-load-tower-A-phase-Z on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn https://phabricator.wikimedia.org/T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:47:14] ACKNOWLEDGEMENT - ps1-a7-codfw-infeed-load-tower-B-phase-X on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn https://phabricator.wikimedia.org/T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:47:14] ACKNOWLEDGEMENT - ps1-a7-codfw-infeed-load-tower-B-phase-Y on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn https://phabricator.wikimedia.org/T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:47:15] ACKNOWLEDGEMENT - ps1-a7-codfw-infeed-load-tower-B-phase-Z on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn https://phabricator.wikimedia.org/T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:47:43] jynus: I can [08:47:59] I can from localhost [08:48:02] but not from cumin [08:48:04] that's weird [08:48:06] I can from cumin [08:48:08] | 667784 | root | localhost | NULL | Query | 817 | Waiting for table flush | FLUSH TABLES WITH READ LOCK | 0.000 | [08:48:08] | 668207 | dump | 10.192.16.96:56432 | NULL | Query | 1481 | Writing to net | SELECT /*!40001 SQL_NO_CACHE */ * FROM `wikishared`.`cx_corpora` | 0.000 | [08:48:46] yeah, I think a dump and a snapshot is happening at the same time [08:48:50] and it doesn't like it [08:48:55] haha poor db2101 [08:49:28] <_joe_> indeed [08:50:26] I cannot telnet db2101.codfw.wmnet 3306 from cumin, can you? [08:50:32] I can to other hosts [08:50:37] or maybe I am misstyping [08:50:53] it is 3320 [08:51:01] eh? [08:51:06] the port isn't 3306 [08:51:09] oh, it is single instance, but non default port [08:51:10] it is 3320 [08:51:22] that threw me off [08:51:30] hehe [08:51:38] I think I will kill the snapshot [08:51:40] nc -v FTW BTW :) [08:51:43] yeah, we did that because it is a backup source, and to be consistent with the others [08:51:50] yes [08:51:52] nc -zv :) [08:52:01] also on eqiad it is different [08:52:07] because hw failure [08:52:41] (03PS6) 10Dzahn: Add profile for a static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [08:53:08] !log kill FTWRL on db2101 [08:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:17] that will kill the snapshotting [08:53:24] (03Abandoned) 10Dzahn: add profile to setup static-codereview.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/594669 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [08:53:25] will retry later [08:53:31] and should fix the alert [08:55:17] (03PS7) 10Dzahn: Add profile for a static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [08:55:59] ACKNOWLEDGEMENT - MariaDB Slave Lag: x1 on db2101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1183.34 seconds Jcrespo backups running created overload https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:56:18] !log restarting ps1-a4-eqiad.mgmt.eqiad.wmnet. [08:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:59] RECOVERY - MariaDB Slave Lag: x1 on db2101 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:58:15] (03PS8) 10Dzahn: Add profile and module for for static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [08:58:56] jbond42: possible this is related to your work? https://phabricator.wikimedia.org/T251987 [08:59:11] looking [08:59:32] mutante: yes definetly ill achknolage and fix [08:59:49] jbond42: i acked them after making the ticket. and thanks [09:00:01] great thanks [09:00:07] PROBLEM - ps1-a4-eqiad-infeed-load-tower-A-phase-Y on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:00:15] ^ also me :) [09:00:23] PROBLEM - ps1-a4-eqiad-infeed-load-tower-B-phase-X on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:00:25] alright [09:00:29] PROBLEM - ps1-a4-eqiad-infeed-load-tower-A-phase-Z on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:00:31] RECOVERY - dump of m3 in eqiad on db1115 is OK: Last dump for m3 at eqiad (db1117.eqiad.wmnet:3323) taken on 2020-05-06 08:05:54 (55 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:00:35] PROBLEM - ps1-a4-eqiad-infeed-load-tower-A-phase-X on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:18] (03CR) 10Dzahn: [C: 03+2] Add profile and module for for static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [09:01:27] (03PS9) 10Dzahn: Add profile and module for for static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [09:01:37] PROBLEM - ps1-a4-eqiad-infeed-load-tower-B-phase-Y on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:04] ACKNOWLEDGEMENT - ps1-a4-eqiad-infeed-load-tower-A-phase-X on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call John Bond Currently testing a snmp reset script - jb - T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:04] ACKNOWLEDGEMENT - ps1-a4-eqiad-infeed-load-tower-A-phase-Y on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call John Bond Currently testing a snmp reset script - jb - T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:04] ACKNOWLEDGEMENT - ps1-a4-eqiad-infeed-load-tower-A-phase-Z on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call John Bond Currently testing a snmp reset script - jb - T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:04] ACKNOWLEDGEMENT - ps1-a4-eqiad-infeed-load-tower-B-phase-X on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call John Bond Currently testing a snmp reset script - jb - T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:04] ACKNOWLEDGEMENT - ps1-a4-eqiad-infeed-load-tower-B-phase-Y on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call John Bond Currently testing a snmp reset script - jb - T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:28] 10Operations, 10ops-codfw: ps1-a7-codfw - monitoring alerts - https://phabricator.wikimedia.org/T251987 (10jbond) This is caused by me testing a script to reset the snmp communities, ps1-a4-eqiad is also affected [09:02:45] 10Operations, 10ops-codfw: ps1-a7-codfw - monitoring alerts - https://phabricator.wikimedia.org/T251987 (10jbond) p:05Triage→03Medium a:03jbond [09:03:57] PROBLEM - ps1-a4-eqiad-infeed-load-tower-B-phase-Z on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:05:54] ACKNOWLEDGEMENT - ps1-a4-eqiad-infeed-load-tower-B-phase-Z on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call John Bond testing snmp reset script - jbond - T251987 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:07:23] (03PS2) 10Ema: prometheus: remove node_vhtcpd [puppet] - 10https://gerrit.wikimedia.org/r/593195 (https://phabricator.wikimedia.org/T251374) [09:08:41] (03CR) 10Hashar: [C: 03+1] "Sounds good so :)" [puppet] - 10https://gerrit.wikimedia.org/r/593344 (owner: 10Krinkle) [09:09:23] (03CR) 10Hashar: [C: 03+1] "I replied too fast and forgot: Thank you Timo for the extra details!" [puppet] - 10https://gerrit.wikimedia.org/r/593344 (owner: 10Krinkle) [09:09:37] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: filter out notifications disabled on /alerts, add /problems [puppet] - 10https://gerrit.wikimedia.org/r/594441 (owner: 10Filippo Giunchedi) [09:10:28] (03PS1) 10Dzahn: add static-codereview profile to miscweb, comment out monitoring [puppet] - 10https://gerrit.wikimedia.org/r/594674 (https://phabricator.wikimedia.org/T243056) [09:12:10] !log Upgrade package on s3 and s7 master (db1123 and db1086) in preparation for tomorrow's restart - T251158 [09:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:13] T251158: Upgrade and restart s3 and s7 primary DB master: Thu 7th May - https://phabricator.wikimedia.org/T251158 [09:12:23] (03PS2) 10Dzahn: add static-codereview profile to miscweb, comment out monitoring [puppet] - 10https://gerrit.wikimedia.org/r/594674 (https://phabricator.wikimedia.org/T243056) [09:13:52] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/22329/miscweb1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/594674 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [09:15:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add flag -frontend_delay [software/purged] - 10https://gerrit.wikimedia.org/r/594454 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [09:17:16] (03CR) 10Dzahn: "> I set profile::base::notifications: disabled in hiera fairly regularly when brining hosts into service to avoid false positives" [puppet] - 10https://gerrit.wikimedia.org/r/594441 (owner: 10Filippo Giunchedi) [09:17:41] (03PS1) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [09:17:57] 10Operations, 10DBA: Upgrade and restart s3 and s7 primary DB master: Thu 7th May - https://phabricator.wikimedia.org/T251158 (10Marostegui) Package upgraded on db1123 and db1086. [09:18:10] (03CR) 10Ema: [C: 03+2] Add flag -frontend_delay [software/purged] - 10https://gerrit.wikimedia.org/r/594454 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [09:18:19] (03PS20) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [09:18:38] (03CR) 10Ema: [C: 03+2] prometheus: remove node_vhtcpd [puppet] - 10https://gerrit.wikimedia.org/r/593195 (https://phabricator.wikimedia.org/T251374) (owner: 10Ema) [09:18:46] (03CR) 10jerkins-bot: [V: 04-1] ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [09:19:09] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1003/22325/" [puppet] - 10https://gerrit.wikimedia.org/r/594670 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [09:20:54] (03PS2) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [09:22:59] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10fgiunchedi) Chiming in with two cents and my Prometheus hat: I agree with @ema that none of the options are great unfortunately. Rephrasing to make sure I understand... [09:23:40] (03CR) 10Dzahn: "> I think removing 'notifications disabled' hosts from /alerts is orthogonal to the policy question of whether or not it's okay to disable" [puppet] - 10https://gerrit.wikimedia.org/r/594441 (owner: 10Filippo Giunchedi) [09:24:52] (03CR) 10Ema: [C: 03+1] "Commit message nit, lgtm otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594670 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [09:25:45] (03PS2) 10Ema: purged: use nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/593196 (https://phabricator.wikimedia.org/T251374) [09:30:02] (03PS21) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [09:30:15] RECOVERY - dump of x1 in codfw on db1115 is OK: Last dump for x1 at codfw (db2101.codfw.wmnet:3320) taken on 2020-05-06 08:08:42 (28 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:30:49] (03CR) 10Ema: [C: 03+2] purged: use nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/593196 (https://phabricator.wikimedia.org/T251374) (owner: 10Ema) [09:32:05] (03PS1) 10Dzahn: ATS: add backend for static-codereview on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/594678 (https://phabricator.wikimedia.org/T243056) [09:33:20] 10Operations: icinga Blocked by X-Frame-Options Policy - https://phabricator.wikimedia.org/T251513 (10MoritzMuehlenhoff) [09:34:11] (03PS2) 10Ema: purged: make purge_host_regex default to undef [puppet] - 10https://gerrit.wikimedia.org/r/593197 (https://phabricator.wikimedia.org/T251374) [09:35:49] RECOVERY - ps1-a7-codfw-infeed-load-tower-B-phase-Y on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-B-phase-Y 84 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:35:55] (03PS4) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [09:36:19] RECOVERY - ps1-a7-codfw-infeed-load-tower-A-phase-Z on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-A-phase-Z 706 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:36:19] RECOVERY - ps1-a7-codfw-infeed-load-tower-B-phase-X on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-B-phase-X 325 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:36:19] RECOVERY - ps1-a7-codfw-infeed-load-tower-A-phase-X on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-A-phase-X 653 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:36:35] RECOVERY - ps1-a7-codfw-infeed-load-tower-B-phase-Z on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-B-phase-Z 417 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:36:51] RECOVERY - ps1-a7-codfw-infeed-load-tower-A-phase-Y on ps1-a7-codfw is OK: SNMP OK - ps1-a7-codfw-infeed-load-tower-A-phase-Y 85 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:37:54] (03PS1) 10Kormat: install_server: Add no-srv-format-testing.cfg [puppet] - 10https://gerrit.wikimedia.org/r/594679 (https://phabricator.wikimedia.org/T251768) [09:39:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1121 and remove db1103:3314 from vslow in s4', diff saved to https://phabricator.wikimedia.org/P11159 and previous config saved to /var/cache/conftool/dbconfig/20200506-093940-marostegui.json [09:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:56] (03PS5) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [09:41:58] (03CR) 10Jcrespo: [C: 03+1] install_server: Add no-srv-format-testing.cfg [puppet] - 10https://gerrit.wikimedia.org/r/594679 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [09:43:47] RECOVERY - ps1-a4-eqiad-infeed-load-tower-A-phase-Z on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-A-phase-Z 457 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:47] RECOVERY - ps1-a4-eqiad-infeed-load-tower-A-phase-X on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-A-phase-X 330 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:49] RECOVERY - ps1-a4-eqiad-infeed-load-tower-B-phase-Y on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-B-phase-Y 230 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:55] RECOVERY - ps1-a4-eqiad-infeed-load-tower-B-phase-X on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-B-phase-X 346 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:56] (03CR) 10Kormat: [C: 03+2] install_server: Add no-srv-format-testing.cfg [puppet] - 10https://gerrit.wikimedia.org/r/594679 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [09:44:19] RECOVERY - ps1-a4-eqiad-infeed-load-tower-B-phase-Z on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-B-phase-Z 470 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:44:29] RECOVERY - ps1-a4-eqiad-infeed-load-tower-A-phase-Y on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-A-phase-Y 238 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:45:07] (03CR) 10Marostegui: install_server: Add no-srv-format-testing.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594679 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [09:46:47] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955 (10fgiunchedi) We're in a much better place nowadays: all custom recipes have been moved to `partman/custom` and what's left are mostly special cases: ` $ git grep -i partman/cu... [09:47:06] (03CR) 10Muehlenhoff: "I realise this is just for a test, but a quick remark: When we polish this up for production, we should fold into into the new standardise" [puppet] - 10https://gerrit.wikimedia.org/r/594679 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [09:47:14] (03CR) 10Kormat: install_server: Add no-srv-format-testing.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594679 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [09:48:31] (03CR) 10Marostegui: install_server: Add no-srv-format-testing.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594679 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [09:49:11] (03CR) 10Jbond: [C: 03+2] ferm-status: refactor and add support for multiple tables [puppet] - 10https://gerrit.wikimedia.org/r/593237 (owner: 10Jbond) [09:52:09] (03CR) 10Jbond: [C: 03+2] apereo_cas: add more timeout values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587515 (owner: 10Jbond) [09:52:28] !log enable rember me feature of CAS [09:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:32] q! [09:52:47] (03PS3) 10Ema: purged: make purge_host_regex default to undef [puppet] - 10https://gerrit.wikimedia.org/r/593197 (https://phabricator.wikimedia.org/T251374) [09:53:02] (03PS6) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [09:53:15] (03CR) 10Kormat: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/594679 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [09:53:18] (03PS5) 10Addshore: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [09:55:18] (03CR) 10Addshore: [C: 04-1] Wikidata: Define entity sources configuration (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [09:55:36] (03PS5) 10Addshore: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [09:55:40] (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1003/22335/" [puppet] - 10https://gerrit.wikimedia.org/r/593197 (https://phabricator.wikimedia.org/T251374) (owner: 10Ema) [09:56:24] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Tobias Andersson to the ldap/wmde group - https://phabricator.wikimedia.org/T251997 (10toan) [09:56:27] (03CR) 10jerkins-bot: [V: 04-1] Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [09:56:41] (03PS1) 10Jbond: apereo_cas: use 7 days as a max time value [puppet] - 10https://gerrit.wikimedia.org/r/594680 [09:56:54] (03CR) 10Jbond: [C: 03+2] apereo_cas: use 7 days as a max time value [puppet] - 10https://gerrit.wikimedia.org/r/594680 (owner: 10Jbond) [09:56:56] (03CR) 10Vgutierrez: [C: 03+1] purged: make purge_host_regex default to undef [puppet] - 10https://gerrit.wikimedia.org/r/593197 (https://phabricator.wikimedia.org/T251374) (owner: 10Ema) [09:57:06] (03CR) 10Ema: [C: 03+2] purged: make purge_host_regex default to undef [puppet] - 10https://gerrit.wikimedia.org/r/593197 (https://phabricator.wikimedia.org/T251374) (owner: 10Ema) [09:57:12] (03CR) 10Muehlenhoff: "> standard.cfg uses software raid," [puppet] - 10https://gerrit.wikimedia.org/r/594679 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [09:57:25] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: use 7 days as a max time value [puppet] - 10https://gerrit.wikimedia.org/r/594680 (owner: 10Jbond) [09:58:25] (03CR) 10Addshore: [C: 04-1] Wikidata client wikis: Define entity sources configuration (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [09:58:35] (03PS5) 10Addshore: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [09:59:28] (03CR) 10jerkins-bot: [V: 04-1] Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [09:59:41] (03CR) 10Addshore: [C: 04-1] Commons: Define entity sources configuration (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [09:59:53] (03PS6) 10Addshore: Wikidata/Wikibase: use entity source Wikibase setting for all wikibase-enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569261 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [10:00:15] !log installing remaining openldap security updates (client-side libs, tools) [10:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:01] (03CR) 10Addshore: [C: 04-1] Wikidata/Wikibase: use entity source Wikibase setting for all wikibase-enabled wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569261 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [10:01:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:02:11] (03PS1) 10Jbond: Apereo_cas: correct typo and increase default max session time [puppet] - 10https://gerrit.wikimedia.org/r/594681 [10:03:55] (03CR) 10Jbond: [C: 03+2] Apereo_cas: correct typo and increase default max session time [puppet] - 10https://gerrit.wikimedia.org/r/594681 (owner: 10Jbond) [10:07:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:07:22] (03PS1) 10Kormat: install_server: no-srv-format-testing.cfg v2 [puppet] - 10https://gerrit.wikimedia.org/r/594683 (https://phabricator.wikimedia.org/T251768) [10:09:09] (03CR) 10Kormat: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/594679 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [10:10:12] (03PS3) 10Giuseppe Lavagetto: Add the ability to consume from kafka [software/purged] - 10https://gerrit.wikimedia.org/r/594147 (https://phabricator.wikimedia.org/T133821) [10:10:14] (03PS5) 10Giuseppe Lavagetto: Add integration tests using docker-compose [software/purged] - 10https://gerrit.wikimedia.org/r/594148 (https://phabricator.wikimedia.org/T133821) [10:10:27] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add `Jgiannelos` to `wmf` LDAP group - https://phabricator.wikimedia.org/T251899 (10Jgiannelos) [10:10:47] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [10:11:27] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add `Jgiannelos` to `wmf` LDAP group - https://phabricator.wikimedia.org/T251899 (10Jgiannelos) Hi @colewhite I updated the purpose field in the task. [10:13:14] (03CR) 10Marostegui: [C: 03+1] "I remember having some issues with this but yeeeears ago (like a lot of years ago)." [puppet] - 10https://gerrit.wikimedia.org/r/594683 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [10:13:34] (03Abandoned) 10Vgutierrez: ATS: Add the client TTFB to ats-tls log [puppet] - 10https://gerrit.wikimedia.org/r/594670 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [10:14:07] (03CR) 10Kormat: [C: 03+2] install_server: no-srv-format-testing.cfg v2 [puppet] - 10https://gerrit.wikimedia.org/r/594683 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [10:19:01] (03PS7) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [10:19:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:19:48] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add `Jgiannelos` to `wmf` LDAP group - https://phabricator.wikimedia.org/T251899 (10Jgiannelos) CC'ing also my manager @dr0ptp4kt [10:20:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:20:44] !log restarting apache on dbmonitor/grafana/miscweb/graphite/netmon to pick up openldap update [10:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:12] (03PS1) 10Ema: Release 0.10: new flag -frontend_delay [software/purged] - 10https://gerrit.wikimedia.org/r/594684 (https://phabricator.wikimedia.org/T249583) [10:23:11] (03PS2) 10Ema: Release 0.10: new flag -frontend_delay [software/purged] - 10https://gerrit.wikimedia.org/r/594684 (https://phabricator.wikimedia.org/T249583) [10:24:06] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [10:24:08] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [10:24:37] (03CR) 10Ema: [C: 03+2] Release 0.10: new flag -frontend_delay [software/purged] - 10https://gerrit.wikimedia.org/r/594684 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [10:26:00] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:26:51] kormat: ^ you still have a pending change on the master [10:27:03] oops, thanks [10:27:05] (03PS8) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [10:27:38] !log cp2027: test purged v0.10 [10:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:22] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:30:11] (03CR) 10Volans: [C: 04-1] "The paths have a typo." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594602 (owner: 10CRusnov) [10:30:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Patch Set 1:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/594579 (owner: 10Bstorm) [10:34:08] (03CR) 10Volans: [C: 03+1] "Sorry for the delay, post-merge +1 fwiw" [puppet] - 10https://gerrit.wikimedia.org/r/593237 (owner: 10Jbond) [10:34:50] thanks volans [10:38:43] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for apache2 on xhgui [puppet] - 10https://gerrit.wikimedia.org/r/594685 (https://phabricator.wikimedia.org/T135991) [10:48:56] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for apache2 on xhgui [puppet] - 10https://gerrit.wikimedia.org/r/594685 (https://phabricator.wikimedia.org/T135991) [10:49:41] (03PS3) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [10:49:43] (03PS1) 10Vgutierrez: ATS: Use the analytics log on atslog-tls [puppet] - 10https://gerrit.wikimedia.org/r/594686 [10:49:45] (03PS1) 10Vgutierrez: ATS: Remove tls log and the associated log format [puppet] - 10https://gerrit.wikimedia.org/r/594687 [10:52:04] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/22338/" [puppet] - 10https://gerrit.wikimedia.org/r/594686 (owner: 10Vgutierrez) [10:55:23] (03PS1) 10Jbond: apereo_cas: update to have no embeded webserver [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594688 (https://phabricator.wikimedia.org/T233950) [10:56:16] (03PS1) 10Jbond: apereo_cas: add support for external tomcat instance [puppet] - 10https://gerrit.wikimedia.org/r/594689 (https://phabricator.wikimedia.org/T233950) [10:56:52] (03CR) 10Jbond: [C: 04-1] "self -1 as the production nodes are not configured for this" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594688 (https://phabricator.wikimedia.org/T233950) (owner: 10Jbond) [10:57:32] (03PS2) 10KartikMistry: Enable ContentTranslation in Armenian Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594668 (https://phabricator.wikimedia.org/T249229) [10:59:35] (03CR) 10Ema: [C: 03+1] ATS: Use the analytics log on atslog-tls [puppet] - 10https://gerrit.wikimedia.org/r/594686 (owner: 10Vgutierrez) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200506T1100). [11:00:04] chiborg and kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] * kart_ is here.. [11:02:20] It seems change https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/593502/ listed in SWAT need to backport to wmf branch, but change seems in master branch right now. I also see no +1 on it. [11:02:25] chiborg ^ [11:03:12] sorry, it's been a long time since i did a swat deploy, what do I have to do? [11:05:02] chiborg: You need to backport it to wmf branch (see: https://versions.toolforge.org/ for which version you want) and get +1 from Developer. I'm not sure about this can be deploy right now. [11:05:55] chiborg: Do you know who usually deploys in this repository? [11:06:04] I'll go ahead with my change meanwhile. [11:06:04] (03PS9) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [11:06:21] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594668 (https://phabricator.wikimedia.org/T249229) (owner: 10KartikMistry) [11:06:23] chiborg: Backports that aren't merged in master can't be SWAT-deployed [11:06:52] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for apache2 on xhgui [puppet] - 10https://gerrit.wikimedia.org/r/594691 (https://phabricator.wikimedia.org/T135991) [11:07:08] Urbanecm: that too :) [11:07:14] (03Merged) 10jenkins-bot: Enable ContentTranslation in Armenian Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594668 (https://phabricator.wikimedia.org/T249229) (owner: 10KartikMistry) [11:07:21] Urbanecm: with some exception though :) [11:09:40] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for apache2 on graphite [puppet] - 10https://gerrit.wikimedia.org/r/594691 (https://phabricator.wikimedia.org/T135991) [11:10:22] (03CR) 10Ayounsi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/594441 (owner: 10Filippo Giunchedi) [11:13:19] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|594668|Enable ContentTranslation in Armenian WP as a default tool (T249229)]] (duration: 01m 08s) [11:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:23] T249229: Enable Content Translation in Armenian Wikipedia as a default tool - https://phabricator.wikimedia.org/T249229 [11:16:46] 10Operations, 10CAS-SSO: icinga Blocked by X-Frame-Options Policy - https://phabricator.wikimedia.org/T251513 (10MoritzMuehlenhoff) [11:17:40] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10MoritzMuehlenhoff) [11:18:18] 10Operations, 10CAS-SSO, 10User-jbond: Create a staging environment for CAS - https://phabricator.wikimedia.org/T233930 (10MoritzMuehlenhoff) [11:18:22] 10Operations, 10CAS-SSO, 10User-jbond: Create a staging environment for CAS - https://phabricator.wikimedia.org/T233930 (10MoritzMuehlenhoff) 05Open→03Resolved We now have a staging system, closing this. [11:18:25] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10MoritzMuehlenhoff) [11:18:38] 10Operations, 10CAS-SSO, 10User-jbond: Cross data center setup for CAS - https://phabricator.wikimedia.org/T233931 (10MoritzMuehlenhoff) [11:18:48] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10MoritzMuehlenhoff) [11:18:55] 10Operations, 10CAS-SSO, 10User-jbond: Icinga Monitoring for CAS - https://phabricator.wikimedia.org/T233935 (10MoritzMuehlenhoff) [11:19:09] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Add U2F/FIDO as second factor for CAS - https://phabricator.wikimedia.org/T233937 (10MoritzMuehlenhoff) [11:19:16] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938 (10MoritzMuehlenhoff) [11:19:27] 10Operations, 10CAS-SSO, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10MoritzMuehlenhoff) [11:19:58] 10Operations, 10CAS-SSO, 10User-jbond: CLI tools for CAS administration - https://phabricator.wikimedia.org/T233940 (10MoritzMuehlenhoff) [11:20:09] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Validate Single Logout Flow - https://phabricator.wikimedia.org/T233941 (10MoritzMuehlenhoff) [11:20:19] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Maintain session history / audit log - https://phabricator.wikimedia.org/T233942 (10MoritzMuehlenhoff) [11:20:33] 10Operations, 10Analytics, 10Security-Team, 10CAS-SSO, and 2 others: Log / alert on too many failing logins / Throttling login attempts - https://phabricator.wikimedia.org/T233944 (10MoritzMuehlenhoff) [11:20:55] 10Operations, 10CAS-SSO, 10User-jbond: Validate user lockout - https://phabricator.wikimedia.org/T233946 (10MoritzMuehlenhoff) [11:21:06] 10Operations, 10CAS-SSO, 10User-jbond: CAS build as a deb - https://phabricator.wikimedia.org/T233947 (10MoritzMuehlenhoff) [11:21:19] 10Operations, 10CAS-SSO, 10User-jbond: CAS build as a deb - https://phabricator.wikimedia.org/T233947 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff I'm already working on this [11:21:31] 10Operations, 10CAS-SSO, 10User-jbond: Review ticket policies - https://phabricator.wikimedia.org/T233948 (10MoritzMuehlenhoff) [11:21:42] 10Operations, 10CAS-SSO, 10User-jbond: Fine-tune CAS logging - https://phabricator.wikimedia.org/T233949 (10MoritzMuehlenhoff) [11:21:55] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Revisit Tomcat deployment of CAS - https://phabricator.wikimedia.org/T233950 (10MoritzMuehlenhoff) [11:22:01] 10Operations, 10CAS-SSO, 10User-jbond: Investigate how automated tasks can authenticate against CAS - https://phabricator.wikimedia.org/T239323 (10MoritzMuehlenhoff) [11:22:08] !log EU SWAT done. [11:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:12] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Icinga check for CAS-protected web services - https://phabricator.wikimedia.org/T245743 (10MoritzMuehlenhoff) [11:22:23] 10Operations, 10CAS-SSO, 10Performance Issue: Investigate CAS performance - https://phabricator.wikimedia.org/T246010 (10MoritzMuehlenhoff) [11:23:36] chiborg: Please get patch merged and submit backport to the branch. You can easily do it by Cherry Pick button on Gerrit. I've marked change as 'not done' in the Deployment calendar. [11:26:42] (03CR) 10Elukey: [C: 03+2] profile::kerberos::client: allow to set a different credentials cache dir [puppet] - 10https://gerrit.wikimedia.org/r/594516 (owner: 10Elukey) [11:26:54] (03CR) 10Elukey: [C: 03+2] Change Kerberos credentials cache location on stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/594519 (owner: 10Elukey) [11:34:12] kart_ Thank you. I've merged the patch to master and created the PRs for the branches, I'll reschedule or wait for the train. [11:35:25] (03PS1) 10Hnowlan: changeprop: produce purge events to topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/594696 [11:35:59] (03PS1) 10Jbond: pcc: refactor code [puppet] - 10https://gerrit.wikimedia.org/r/594697 [11:37:26] (03PS10) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [11:49:40] (03PS11) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [11:55:58] (03PS12) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [11:58:43] (03CR) 10Gehel: [C: 03+1] "> Patch Set 6: Code-Review+1" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/593833 (https://phabricator.wikimedia.org/T222669) (owner: 10Mstyles) [12:00:15] (03PS13) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [12:00:53] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable base::service_auto_restart for apache2 on graphite [puppet] - 10https://gerrit.wikimedia.org/r/594691 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:12:51] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add `Jgiannelos` to `wmf` LDAP group - https://phabricator.wikimedia.org/T251899 (10dr0ptp4kt) Approved [12:28:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) (owner: 10Arturo Borrero Gonzalez) [12:30:45] (03PS1) 10Marostegui: install_server: Reimage db2078 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/594702 (https://phabricator.wikimedia.org/T250666) [12:31:31] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2078 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/594702 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [12:32:02] (03CR) 10Volans: "minor nits, it looks ok to me otherwise. Not tested." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594697 (owner: 10Jbond) [12:35:07] (03CR) 10Vgutierrez: [C: 03+2] ATS: Use the analytics log on atslog-tls [puppet] - 10https://gerrit.wikimedia.org/r/594686 (owner: 10Vgutierrez) [12:36:15] (03PS1) 10Elukey: oozie: add a workaround to avoid shlib dirs on hdfs to be dropped [puppet] - 10https://gerrit.wikimedia.org/r/594703 [12:39:16] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/594703 (owner: 10Elukey) [12:39:52] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Some queries causes wdqs-blazegraph on wdqs1006 to crash and restart - https://phabricator.wikimedia.org/T213191 (10Gehel) 05Stalled→03Declined This is probably a duplicate of T242453. There isn't enough context here to investigate more. [12:43:33] (03PS1) 10Elukey: jupyterhub: allow notebooks to write under /run/user [puppet] - 10https://gerrit.wikimedia.org/r/594704 [12:47:40] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10elukey) 05Open→03Resolved [12:48:26] (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/22340/" [puppet] - 10https://gerrit.wikimedia.org/r/594687 (owner: 10Vgutierrez) [12:48:50] (03CR) 10Elukey: [C: 03+2] oozie: add a workaround to avoid shlib dirs on hdfs to be dropped [puppet] - 10https://gerrit.wikimedia.org/r/594703 (owner: 10Elukey) [12:54:30] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Jclark-ctr) led`s flashing in sequence of 2 most likely a processor error Case Reference ID: 5346998524 Status: Case is generated and in Progress Product: HPE ProLiant DL360 Gen10 8SF... [12:59:57] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Tobias Andersson to the ldap/wmde group - https://phabricator.wikimedia.org/T251997 (10WMDE-leszek) As an engineering manager at WMDE, I endorse this request. [13:00:05] brennen and hashar: Time to snap out of that daydream and deploy Mediawiki train - American+European Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200506T1300). [13:08:23] !log start swift decom ms-be101[678] - T252008 [13:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:26] T252008: Decom ms-be101[678] - https://phabricator.wikimedia.org/T252008 [13:13:55] (03PS2) 10Vgutierrez: ATS: Remove tls log and the associated log format [puppet] - 10https://gerrit.wikimedia.org/r/594687 [13:13:57] (03PS4) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [13:13:59] (03PS1) 10Vgutierrez: ATS: Add ensure support for Log definitions [puppet] - 10https://gerrit.wikimedia.org/r/594706 [13:14:26] (03CR) 10Elukey: [C: 04-1] "Nope wrong unit :)" [puppet] - 10https://gerrit.wikimedia.org/r/594704 (owner: 10Elukey) [13:14:49] (03Abandoned) 10Elukey: jupyterhub: allow notebooks to write under /run/user [puppet] - 10https://gerrit.wikimedia.org/r/594704 (owner: 10Elukey) [13:15:25] (03CR) 10jerkins-bot: [V: 04-1] ATS: Add ensure support for Log definitions [puppet] - 10https://gerrit.wikimedia.org/r/594706 (owner: 10Vgutierrez) [13:16:43] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for apache2 on graphite [puppet] - 10https://gerrit.wikimedia.org/r/594691 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:18:28] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [13:19:06] gehel,dcausse --^ interesting [13:19:08] same behavior [13:19:56] !log cp: upgrade purged to v0.10 [13:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:18] (03PS2) 10Vgutierrez: ATS: Add ensure support for Log definitions [puppet] - 10https://gerrit.wikimedia.org/r/594706 [13:20:20] (03PS3) 10Vgutierrez: ATS: Remove tls log and the associated log format [puppet] - 10https://gerrit.wikimedia.org/r/594687 [13:20:22] (03PS5) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [13:20:43] elukey: yes :/ but it's worrisome if no reindex are running [13:21:42] 10Operations, 10netops: Upgrade Routinator 3000 to 0.7.0 - https://phabricator.wikimedia.org/T252010 (10ayounsi) p:05Triage→03Low [13:23:46] (03PS1) 10Dzahn: httpbb: add test for static-codereview.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/594707 (https://phabricator.wikimedia.org/T243056) [13:24:16] (03CR) 10Ppchelko: [C: 04-1] "the tgz file is missing as well" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594696 (owner: 10Hnowlan) [13:24:38] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:17] this is me --^ [13:25:19] (03CR) 10Dzahn: [C: 03+1] Enable base::service_auto_restart for apache2 on xhgui [puppet] - 10https://gerrit.wikimedia.org/r/594685 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:25:26] !log add routinator 3000 0.7.0 to buster-wikimedia - T252010 [13:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:30] T252010: Upgrade Routinator 3000 to 0.7.0 - https://phabricator.wikimedia.org/T252010 [13:25:58] !log Upgrade Routinator 3000 to 0.7.0 on rpki2001 - T252010 [13:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:11] elukey: looking at logstash1010 I suspect it'll complain soonish as well [13:26:28] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:57] dcausse: :( [13:27:01] !log installing graphicsmagick security updates [13:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:26] !log puppetmaster - revoking cert for webserver-misc-static, not used anymore, merged into webserver-misc-apps [13:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:44] (03Restored) 10Elukey: jupyterhub: allow notebooks to write under /run/user [puppet] - 10https://gerrit.wikimedia.org/r/594704 (owner: 10Elukey) [13:28:25] (03PS2) 10Elukey: jupyterhub: allow notebooks to write under /run/user [puppet] - 10https://gerrit.wikimedia.org/r/594704 [13:29:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/594704 (owner: 10Elukey) [13:31:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:32:03] (03PS3) 10Elukey: jupyterhub: allow notebooks to write under /run/user [puppet] - 10https://gerrit.wikimedia.org/r/594704 [13:32:11] only updated the commit msg :) [13:32:17] !log Restarting CI Jenkins [13:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:33:41] (03PS2) 10Jbond: pcc: refactor code [puppet] - 10https://gerrit.wikimedia.org/r/594697 [13:33:43] (03PS1) 10Jbond: pcc: add ability to parse commit messages for `Hosts:` lines [puppet] - 10https://gerrit.wikimedia.org/r/594708 [13:33:52] (03CR) 10Elukey: [C: 03+2] jupyterhub: allow notebooks to write under /run/user [puppet] - 10https://gerrit.wikimedia.org/r/594704 (owner: 10Elukey) [13:35:18] (03CR) 10Jbond: pcc: add ability to parse commit messages for `Hosts:` lines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594708 (owner: 10Jbond) [13:35:34] (03CR) 10Elukey: [C: 03+2] "Luca this is a big PEBCAK" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594704 (owner: 10Elukey) [13:35:36] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [13:36:09] !log puppetmaster - revoking cert for webserver-misc-apps , recreating it with static-codereview.wikimedia.org as addiitonal SAN (T243056) [13:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:12] T243056: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 [13:36:39] (03PS1) 10Elukey: jupyter: fix rw directory for user units [puppet] - 10https://gerrit.wikimedia.org/r/594711 [13:37:07] ETOOMUCHKERBEROSINMYBRAIN --^ [13:37:25] (03CR) 10Elukey: [C: 03+2] jupyter: fix rw directory for user units [puppet] - 10https://gerrit.wikimedia.org/r/594711 (owner: 10Elukey) [13:39:43] 10Operations, 10netops: Upgrade Routinator 3000 to 0.7.0 - https://phabricator.wikimedia.org/T252010 (10ayounsi) So far so good, will let it sit until tomorrow before tackling rpki1001. [13:40:19] (03CR) 10Vgutierrez: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1002/22341/" [puppet] - 10https://gerrit.wikimedia.org/r/594706 (owner: 10Vgutierrez) [13:41:16] (03PS3) 10Jbond: pcc: refactor code [puppet] - 10https://gerrit.wikimedia.org/r/594697 [13:41:18] (03CR) 10Jbond: "updated" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594697 (owner: 10Jbond) [13:43:52] (03PS1) 10Dzahn: ssl: update webserver-misc-apps, delete webserver-misc static cert [puppet] - 10https://gerrit.wikimedia.org/r/594712 (https://phabricator.wikimedia.org/T243056) [13:44:26] (03PS2) 10Jbond: pcc: add ability to parse commit messages for `Hosts:` lines [puppet] - 10https://gerrit.wikimedia.org/r/594708 [13:44:51] (03CR) 10jerkins-bot: [V: 04-1] pcc: add ability to parse commit messages for `Hosts:` lines [puppet] - 10https://gerrit.wikimedia.org/r/594708 (owner: 10Jbond) [13:46:18] (03CR) 10Dzahn: [C: 03+2] "openssl x509 -in webserver-misc-apps.discovery.wmnet.crt -text -noout | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/594712 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [13:46:30] (03PS3) 10Jbond: pcc: add ability to parse commit messages for `Hosts:` lines [puppet] - 10https://gerrit.wikimedia.org/r/594708 [13:46:42] (03PS3) 10Vgutierrez: ATS: Add ensure support for Log definitions [puppet] - 10https://gerrit.wikimedia.org/r/594706 [13:46:44] (03PS4) 10Vgutierrez: ATS: Remove tls log and the associated log format [puppet] - 10https://gerrit.wikimedia.org/r/594687 [13:46:46] (03PS6) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [13:47:52] (03PS2) 10Hnowlan: changeprop: produce purge events to topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/594696 [13:48:03] (03CR) 10Hnowlan: changeprop: produce purge events to topic (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594696 (owner: 10Hnowlan) [13:48:22] (03CR) 10Filippo Giunchedi: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [13:49:40] (03CR) 10Filippo Giunchedi: "See inline re: cache, the rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [13:50:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/588769 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [13:51:15] (03PS2) 10Dzahn: httpbb: add test for static-codereview.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/594707 (https://phabricator.wikimedia.org/T243056) [13:51:20] (03CR) 10Ottomata: "Ah effectively this will only drop anything 5 years, and only if oozie has been running that whole time? If oozie is restarted the 1000 d" [puppet] - 10https://gerrit.wikimedia.org/r/594703 (owner: 10Elukey) [13:52:02] (03CR) 10Dzahn: [C: 03+2] "[cumin1001:~] $ httpbb /tmp/sc --hosts=miscweb1002.eqiad.wmnet,miscweb2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/594707 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [13:53:14] (03CR) 10Dzahn: [C: 03+2] ATS: add backend for static-codereview on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/594678 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [13:57:02] (03CR) 10CDanis: pcc: add ability to parse commit messages for `Hosts:` lines (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594708 (owner: 10Jbond) [13:58:18] (03CR) 10Dzahn: [C: 03+2] contint: Add redirect from stale doc dirs to current ones [puppet] - 10https://gerrit.wikimedia.org/r/593344 (owner: 10Krinkle) [13:59:01] (03CR) 10Elukey: [C: 03+2] "> Ah effectively this will only drop anything 5 years, and only if" [puppet] - 10https://gerrit.wikimedia.org/r/594703 (owner: 10Elukey) [13:59:59] (03CR) 10Vgutierrez: "PCC for PS3: https://puppet-compiler.wmflabs.org/compiler1003/22343/" [puppet] - 10https://gerrit.wikimedia.org/r/594706 (owner: 10Vgutierrez) [14:00:13] (03PS1) 10Muehlenhoff: Add .gitreview file [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594717 [14:01:34] (03CR) 10Dzahn: "works: curl https://doc.wikimedia.org//mediawiki-extensions-Wikibase/" [puppet] - 10https://gerrit.wikimedia.org/r/593344 (owner: 10Krinkle) [14:02:35] (03CR) 10Volans: pcc: refactor code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594697 (owner: 10Jbond) [14:03:10] (03CR) 10Jbond: "thanks updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594708 (owner: 10Jbond) [14:04:52] (03CR) 10Ema: [C: 03+1] ATS: Add ensure support for Log definitions [puppet] - 10https://gerrit.wikimedia.org/r/594706 (owner: 10Vgutierrez) [14:05:21] (03CR) 10Ppchelko: [C: 04-1] changeprop: produce purge events to topic (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594696 (owner: 10Hnowlan) [14:08:14] (03CR) 10Vgutierrez: [C: 03+2] ATS: Add ensure support for Log definitions [puppet] - 10https://gerrit.wikimedia.org/r/594706 (owner: 10Vgutierrez) [14:08:16] (03PS4) 10Vgutierrez: ATS: Add ensure support for Log definitions [puppet] - 10https://gerrit.wikimedia.org/r/594706 [14:10:09] (03CR) 10RLazarus: [C: 03+1] httpbb: add test for static-codereview.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/594707 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [14:12:25] (03PS4) 10Jbond: pcc: refactor code [puppet] - 10https://gerrit.wikimedia.org/r/594697 [14:12:36] (03CR) 10Jbond: pcc: refactor code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594697 (owner: 10Jbond) [14:13:26] (03PS1) 10Muehlenhoff: Add debian/ directory to the build overlay (WIP) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) [14:13:36] (03PS4) 10Jbond: pcc: add ability to parse commit messages for `Hosts:` lines [puppet] - 10https://gerrit.wikimedia.org/r/594708 [14:14:45] (03CR) 10Ppchelko: [C: 04-1] changeprop: produce purge events to topic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594696 (owner: 10Hnowlan) [14:16:13] (03CR) 10Ottomata: Initial debian commit (032 comments) [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [14:19:17] (03PS1) 10Dzahn: icinga: increase tresholds for check_ssl_http_letsencrypt [puppet] - 10https://gerrit.wikimedia.org/r/594722 (https://phabricator.wikimedia.org/T251726) [14:21:17] (03PS3) 10Dzahn: contint: move common and default Hiera settings to role level [puppet] - 10https://gerrit.wikimedia.org/r/594475 (https://phabricator.wikimedia.org/T224591) [14:21:57] (03CR) 10Muehlenhoff: [C: 04-1] Initial debian commit (031 comment) [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [14:21:59] (03PS1) 10Elukey: Revert "Change Kerberos credentials cache location on stat1005" [puppet] - 10https://gerrit.wikimedia.org/r/594723 [14:22:52] (03CR) 10Dzahn: [C: 04-1] "was expected to be noop but isn't: https://puppet-compiler.wmflabs.org/compiler1001/22344/contint1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/594475 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [14:23:03] (03CR) 10Ppchelko: [C: 04-1] changeprop: produce purge events to topic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594696 (owner: 10Hnowlan) [14:24:29] (03PS1) 10Jbond: pcc: add default paramaters [puppet] - 10https://gerrit.wikimedia.org/r/594724 [14:25:15] (03CR) 10Elukey: [C: 03+2] "https://github.com/openjdk/jdk/blob/master/src/java.security.jgss/share/classes/sun/security/krb5/internal/ccache/FileCredentialsCache.jav" [puppet] - 10https://gerrit.wikimedia.org/r/594723 (owner: 10Elukey) [14:25:47] (03PS4) 10Dzahn: contint: move common and default Hiera settings to role level [puppet] - 10https://gerrit.wikimedia.org/r/594475 (https://phabricator.wikimedia.org/T224591) [14:27:40] (03PS6) 10Hashar: Merge tag 'debian/1.8.17-1' into debian/buster-wikimedia [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/589416 (https://phabricator.wikimedia.org/T242155) [14:29:00] (03PS5) 10Vgutierrez: ATS: Remove tls log and the associated log format [puppet] - 10https://gerrit.wikimedia.org/r/594687 [14:31:18] (03CR) 10Hashar: [V: 03+1 C: 03+1] "I did a git merge --squash of debian/1.8.17-1 , thus the commit was not reflecting a merge of that tag but was still based on the previous" [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/589416 (https://phabricator.wikimedia.org/T242155) (owner: 10Hashar) [14:31:30] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [14:31:40] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [14:36:32] (03PS6) 10Vgutierrez: ATS: Remove tls log and the associated log format [puppet] - 10https://gerrit.wikimedia.org/r/594687 [14:36:33] (03PS7) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [14:36:35] (03PS1) 10Vgutierrez: ATS: Skip absent log definitions on logging.yaml [puppet] - 10https://gerrit.wikimedia.org/r/594725 [14:42:29] (03PS1) 10Elukey: profile::kerberos::client: set KRB5CCNAME with default_ccache_name [puppet] - 10https://gerrit.wikimedia.org/r/594726 [14:45:24] (03PS2) 10Elukey: profile::kerberos::client: set KRB5CCNAME with default_ccache_name [puppet] - 10https://gerrit.wikimedia.org/r/594726 [14:51:03] (03PS2) 10Vgutierrez: ATS: Skip absent log definitions on logging.yaml [puppet] - 10https://gerrit.wikimedia.org/r/594725 [14:51:05] (03PS7) 10Vgutierrez: ATS: Remove tls log and the associated log format [puppet] - 10https://gerrit.wikimedia.org/r/594687 [14:51:07] (03PS8) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [14:51:09] (03CR) 10Bearloga: "> 1) What use cases do you have in mind? Is it only for oozie jobs or is there something else?" [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [14:53:06] (03PS3) 10Hnowlan: changeprop: produce purge events to topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/594696 [14:55:12] (03CR) 10Dzahn: [C: 03+2] "now a real noop: https://puppet-compiler.wmflabs.org/compiler1002/22348/" [puppet] - 10https://gerrit.wikimedia.org/r/594475 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [14:55:57] (03PS3) 10Dzahn: contint: switch jenkins/zuul/gearman to contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/594477 (https://phabricator.wikimedia.org/T224591) [14:56:57] 10Operations, 10MediaWiki-extensions-CodeReview: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Dzahn) @Legoktm Here you go: https://static-codereview.wikimedia.org/ [14:57:02] legoktm: ^ [14:57:31] mutante: The urls are broken [14:57:31] https://static-codereview.wikimedia.org/MediaWiki/r1.html [14:57:34] 404 [14:57:46] Reedy: that's known. still waiting for the content [14:57:49] oh [14:57:50] haha [14:57:55] Fair enough [14:58:03] I was presuming that was like "it's here, working!" [14:58:15] i guess it will be a one-time manual upload and not git pulling [14:58:20] because it's 4G and will never change [14:58:47] I'd presume so [14:59:01] Reedy: yea, it is, for everything else besides those files, like the certificate, ATS config, puppet set up the site etc :) [14:59:19] even tests :p [14:59:38] (03CR) 10Ppchelko: [C: 03+2] "let's try it out!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/594696 (owner: 10Hnowlan) [14:59:55] (03Merged) 10jenkins-bot: changeprop: produce purge events to topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/594696 (owner: 10Hnowlan) [15:00:33] (03PS3) 10CRusnov: netbox: Use netbox-extras version of tools [puppet] - 10https://gerrit.wikimedia.org/r/594602 [15:00:52] (03CR) 10CRusnov: "thanks :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594602 (owner: 10CRusnov) [15:01:37] (03Abandoned) 10Jbond: ferm: add a very basic status check [puppet] - 10https://gerrit.wikimedia.org/r/573335 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [15:01:49] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add .gitreview file [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594717 (owner: 10Muehlenhoff) [15:06:36] (03CR) 10Vgutierrez: "expected NOOP from pcc: https://puppet-compiler.wmflabs.org/compiler1003/22349/" [puppet] - 10https://gerrit.wikimedia.org/r/594725 (owner: 10Vgutierrez) [15:07:09] (03CR) 10Jbond: "ready for review. ferm-status runs cleanly on all servers now" [puppet] - 10https://gerrit.wikimedia.org/r/576102 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [15:08:46] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/22350/" [puppet] - 10https://gerrit.wikimedia.org/r/594687 (owner: 10Vgutierrez) [15:09:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] Citoid: Update restbase to 2.7.7 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594513 (https://phabricator.wikimedia.org/T239459) (owner: 10Mvolz) [15:10:36] (03PS3) 10Mvolz: Citoid: Update service-runner to 2.7.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/594513 (https://phabricator.wikimedia.org/T239459) [15:15:54] (03PS2) 10Muehlenhoff: Add debian/ directory to the build overlay (WIP) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) [15:18:19] 10Operations, 10Traffic, 10Patch-For-Review: cache_upload varnish-fe exhausting transient memory - https://phabricator.wikimedia.org/T249809 (10ema) >>! In T249809#6107775, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/KM7y43EBLkHzneNNgUu... [15:21:46] (03CR) 10Ema: [C: 03+1] ATS: Skip absent log definitions on logging.yaml [puppet] - 10https://gerrit.wikimedia.org/r/594725 (owner: 10Vgutierrez) [15:22:09] (03CR) 10Ema: [C: 03+1] ATS: Remove tls log and the associated log format [puppet] - 10https://gerrit.wikimedia.org/r/594687 (owner: 10Vgutierrez) [15:22:26] (03CR) 10Vgutierrez: [C: 03+2] ATS: Skip absent log definitions on logging.yaml [puppet] - 10https://gerrit.wikimedia.org/r/594725 (owner: 10Vgutierrez) [15:22:32] 10Operations, 10MediaWiki-extensions-CodeReview: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10MaxSem) >>! In T243056#6112939, @Dzahn wrote: > @Legoktm Here you go: https://static-codereview.wikimedia.org/ https://static-coderevi... [15:22:54] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Traffic: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) [15:25:44] (03CR) 10Vgutierrez: [C: 03+2] ATS: Remove tls log and the associated log format [puppet] - 10https://gerrit.wikimedia.org/r/594687 (owner: 10Vgutierrez) [15:25:55] 10Operations, 10CAS-SSO: icinga Blocked by X-Frame-Options Policy - https://phabricator.wikimedia.org/T251513 (10colewhite) p:05Triage→03Medium [15:26:24] 10Operations, 10serviceops: Update php-xdebug to 2.9.2 in apt.wm.o component/php72 - https://phabricator.wikimedia.org/T244716 (10jgleeson) Hi folks, could we get this added to our repo? [15:26:27] 10Operations, 10Analytics: Analytics1060 unresponsive - https://phabricator.wikimedia.org/T251973 (10colewhite) p:05Triage→03Medium [15:26:50] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Tobias Andersson to the ldap/wmde group - https://phabricator.wikimedia.org/T251997 (10colewhite) p:05Triage→03Medium a:03colewhite [15:27:15] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "Discussed on IRC, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [15:30:26] 10Operations, 10Analytics: Analytics1060 unresponsive - https://phabricator.wikimedia.org/T251973 (10elukey) 05Open→03Resolved a:03elukey [15:31:07] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10Kormat) [15:34:31] (03CR) 10Alexandros Kosiaris: ferm: Add status check (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [15:35:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] ferm: enable ferm status script [puppet] - 10https://gerrit.wikimedia.org/r/576102 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [15:35:16] (03Abandoned) 10Thcipriani: blubberoid: bump helm chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/524813 (owner: 10Thcipriani) [15:39:00] (03Abandoned) 10Thcipriani: Use python2 as basepython [software] - 10https://gerrit.wikimedia.org/r/484806 (owner: 10Thcipriani) [15:41:31] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] Citoid: Update service-runner to 2.7.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/594513 (https://phabricator.wikimedia.org/T239459) (owner: 10Mvolz) [15:42:06] (03PS4) 10Mvolz: Citoid: Update service-runner to 2.7.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/594513 (https://phabricator.wikimedia.org/T239459) [15:55:00] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [15:56:45] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['backup1002.eqiad.wmnet'] ` The log can be found in `/var/... [15:57:35] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [16:05:33] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Blog: Availability to update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10Varnent) VIP is available between 9 AM and 8 PM UTC. It will require updating the CNAME record: ` blog.wikimedia.org CNAME blog-wikimedia-org.go-vip.n... [16:10:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251122 (10ahemmer) Hi @colewhite @Nuria You guys are right, I need access to datasets for analysis rather than the raw data. [16:11:06] 10Operations, 10MediaWiki-extensions-CodeReview: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Dzahn) Yea. My message should be read as "the site has been created, the cert updated, puppet created the config, now i just need to dump... [16:15:45] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [16:20:08] (03PS1) 10Jcrespo: backup1002: Update NIC address for card with link [puppet] - 10https://gerrit.wikimedia.org/r/594746 (https://phabricator.wikimedia.org/T250816) [16:20:39] (03PS2) 10Jcrespo: backup1002: Update NIC address for card with link [puppet] - 10https://gerrit.wikimedia.org/r/594746 (https://phabricator.wikimedia.org/T250816) [16:21:23] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [16:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:07] (03CR) 10Jcrespo: [C: 03+2] backup1002: Update NIC address for card with link [puppet] - 10https://gerrit.wikimedia.org/r/594746 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [16:35:57] PROBLEM - Check systemd state on ms-be1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:28] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Blog: Availability to update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10colewhite) There is ample coverage for that date and time range. Should we also schedule it for a time you will be available? I can handle the request... [16:40:08] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10jcrespo) @Cmjohnson I can POST the server and get to BIOS without issue. However, I did the above change^ because the server didn't boot into... [16:40:46] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10jcrespo) a:05jcrespo→03Cmjohnson [16:43:07] (03CR) 10Cwhite: [C: 03+2] admin: add pdas to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594598 (https://phabricator.wikimedia.org/T251516) (owner: 10Cwhite) [16:46:48] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Blog: Availability to update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10Varnent) Between 15:00 and 20:00 UTC works for me - I will verify with VIP. [16:48:23] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access for Superset for PDas - https://phabricator.wikimedia.org/T251516 (10colewhite) 05Open→03Resolved Thanks @MoritzMuehlenhoff! `wmf` group membership provisioned. Please feel free to reopen if you encounter any related issue. [16:49:02] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10colewhite) p:05Triage→03Medium [16:56:20] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for thanos-fe200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/594754 [16:56:47] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production DNS for thanos-fe200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/594754 (owner: 10Papaul) [16:59:37] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251122 (10Nuria) @colewhite let's remove ssh access for @ahemmer as I do not think is needed , he just needs access to turnilo and sup... [16:59:38] (03PS1) 10Hnowlan: changeprop: repackage [deployment-charts] - 10https://gerrit.wikimedia.org/r/594756 [17:00:03] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for thanos-fe200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/594754 [17:00:52] (03PS1) 10Cwhite: admin: add jgiannelos to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594757 (https://phabricator.wikimedia.org/T251899) [17:01:18] (03CR) 10Nuria: "Let's undo this change as i think he only needs LDAP access. Super thanks." [puppet] - 10https://gerrit.wikimedia.org/r/594593 (https://phabricator.wikimedia.org/T251122) (owner: 10Cwhite) [17:02:14] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/594697 (owner: 10Jbond) [17:02:38] @brennan fix for T252043 is ready [17:02:39] T252043: Special:DeletedContributions: $row does not contain fields needed for comment rev_comment - https://phabricator.wikimedia.org/T252043 [17:02:53] ^@brennen [17:02:59] (03PS2) 10Hnowlan: changeprop: repackage [deployment-charts] - 10https://gerrit.wikimedia.org/r/594756 [17:03:22] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop: repackage [deployment-charts] - 10https://gerrit.wikimedia.org/r/594756 (owner: 10Hnowlan) [17:03:44] (03Merged) 10jenkins-bot: changeprop: repackage [deployment-charts] - 10https://gerrit.wikimedia.org/r/594756 (owner: 10Hnowlan) [17:04:12] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for thanos-fe200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/594754 (owner: 10Papaul) [17:05:22] 10Operations, 10Android-app-Bugs, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 4 others: Incorrect language variant returned for PCS endpoints - https://phabricator.wikimedia.org/T249284 (10JoeWalsh) >>! In T249284#6034265, @Pchelolo wrote: > RESTBase stack only supports requests with... [17:05:34] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:12] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:06:12] (03PS3) 10Cwhite: admin: add ahemmer to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594593 (https://phabricator.wikimedia.org/T251122) [17:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:54] (03CR) 10Volans: "No strong opinion, agree that is not great to have defaults on positional params and might be confusing that you can't override the second" [puppet] - 10https://gerrit.wikimedia.org/r/594724 (owner: 10Jbond) [17:07:48] (03PS4) 10Cwhite: admin: add ahemmer to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594593 (https://phabricator.wikimedia.org/T251123) [17:08:41] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10Papaul) [17:09:04] (03CR) 10Nuria: [C: 03+1] admin: add ahemmer to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594593 (https://phabricator.wikimedia.org/T251123) (owner: 10Cwhite) [17:09:40] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251122 (10colewhite) 05Open→03Declined Per discussion, closing this task and moving to provision T251123. [17:10:04] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/594602 (owner: 10CRusnov) [17:11:16] 10Operations, 10serviceops: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10RLazarus) a:03RLazarus [17:11:39] (03CR) 10Mforns: [C: 03+1] "I think it's simple and immediate, and we always can improve it in the future. Feels like a good option to me!" [puppet] - 10https://gerrit.wikimedia.org/r/594565 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [17:11:51] (03CR) 10Cwhite: [C: 03+2] admin: add ahemmer to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594593 (https://phabricator.wikimedia.org/T251123) (owner: 10Cwhite) [17:12:52] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:08] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10colewhite) 05Open→03Resolved ah212 added to `wmf` ldap group. Please feel free to reopen if you encounter any... [17:15:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251122 (10colewhite) [17:17:45] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Blog: Availability to update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10colewhite) a:03colewhite [17:19:36] (03PS1) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 [17:19:50] (03CR) 10jerkins-bot: [V: 04-1] mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 (owner: 10RLazarus) [17:21:14] (03PS2) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 [17:22:16] (03CR) 10jerkins-bot: [V: 04-1] mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 (owner: 10RLazarus) [17:23:26] (03PS3) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 [17:24:26] (03CR) 10jerkins-bot: [V: 04-1] mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 (owner: 10RLazarus) [17:27:05] (03PS1) 10Hnowlan: changeprop: repackage changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/594762 [17:27:19] (03PS4) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 [17:27:37] (03PS1) 10Cwhite: update blog.wm.o to its new home [dns] - 10https://gerrit.wikimedia.org/r/594763 (https://phabricator.wikimedia.org/T251931) [17:28:11] (03CR) 10Hnowlan: [C: 03+2] changeprop: repackage changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/594762 (owner: 10Hnowlan) [17:28:40] (03Merged) 10jenkins-bot: changeprop: repackage changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/594762 (owner: 10Hnowlan) [17:30:47] RECOVERY - Check systemd state on ms-be1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:35] 10Operations, 10Android-app-Bugs, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 4 others: Incorrect language variant returned for PCS endpoints - https://phabricator.wikimedia.org/T249284 (10Pchelolo) > @Pchelolo thanks for updating this documentation - are there specifics about which... [17:31:44] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:36] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:13] (03PS1) 10Volans: netbox: adapt to new Netbox API [software/spicerack] - 10https://gerrit.wikimedia.org/r/594764 [17:48:43] (03PS5) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) [17:50:07] (03PS1) 10Volans: netbox: adapt to new Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/594765 [17:50:14] (03CR) 10He7d3r: Explicitly install both myspell-pt-pt and myspell-pt-br (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/586456 (https://phabricator.wikimedia.org/T249559) (owner: 10Halfak) [17:50:44] jouncebot: now [17:50:44] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [17:50:47] jouncebot: next [17:50:47] In 0 hour(s) and 9 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200506T1800) [17:50:47] In 0 hour(s) and 9 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200506T1800) [17:52:59] (03CR) 10RLazarus: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [17:54:17] (03CR) 10RLazarus: "Confirmed noop with the hiera flag off: https://puppet-compiler.wmflabs.org/compiler1001/22351/" [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [17:54:23] DannyS712: thanks - i'm having some serious connectivity issues at the moment and stuck on a laptop without creds, may need to get someone to sling that out. [17:54:36] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work): SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10herron) [17:56:58] twentyafterfour: if you could have a look at patch for T252043, appreciated. [17:56:59] T252043: Special:DeletedContributions: $row does not contain fields needed for comment rev_comment - https://phabricator.wikimedia.org/T252043 [17:57:01] @brennen no problem; I broke it, I fix it :) [17:58:03] (03CR) 10CRusnov: [C: 03+1] "LGTM, a surprising amount of overhead went into that choice thing :\" [software/spicerack] - 10https://gerrit.wikimedia.org/r/594764 (owner: 10Volans) [17:58:59] brennen / DannyS712: am I clear to deploy right now or should it wait? [17:59:22] idk; deploy what? [17:59:50] the patch https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/594751/ [18:00:04] brennen and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage with CPT . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200506T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200506T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:20] oh; that is for master, haven't backported it yet, but I think so; I tested it with patch demo and it works [18:01:11] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work): SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10Dzahn) [18:01:14] (03CR) 10Volans: [C: 03+2] "Indeed, glad we're getting rid of it :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/594764 (owner: 10Volans) [18:01:47] DannyS712: brennen asked me to deploy for him because of connectibity issues. since it's un UBN it gets priority [18:01:47] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work): SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10Dzahn) added to Google group maint-announce and Google calendar Ops Vendor Maintenance. [18:02:02] (03CR) 10CRusnov: [C: 03+1] "Looks good" [software/homer] - 10https://gerrit.wikimedia.org/r/594765 (owner: 10Volans) [18:02:20] okay; should I backport it now or after it merges to master? [18:02:33] DannyS712: go ahead with backport [18:02:59] (03CR) 10Volans: [C: 03+2] netbox: adapt to new Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/594765 (owner: 10Volans) [18:03:05] (03PS4) 10CRusnov: netbox: Use netbox-extras version of tools [puppet] - 10https://gerrit.wikimedia.org/r/594602 [18:03:11] backport at https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/594768/ [18:03:23] DannyS712: I just gave the patch on master a +2 [18:06:20] (03Merged) 10jenkins-bot: netbox: adapt to new Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/594765 (owner: 10Volans) [18:07:46] (03Merged) 10jenkins-bot: netbox: adapt to new Netbox API [software/spicerack] - 10https://gerrit.wikimedia.org/r/594764 (owner: 10Volans) [18:08:41] (03CR) 10Ottomata: Initial debian commit (035 comments) [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [18:09:18] (03CR) 10Giuseppe Lavagetto: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [18:12:28] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.34 [software/spicerack] - 10https://gerrit.wikimedia.org/r/594769 [18:13:27] (03CR) 10CRusnov: [C: 03+2] netbox: Use netbox-extras version of tools [puppet] - 10https://gerrit.wikimedia.org/r/594602 (owner: 10CRusnov) [18:14:02] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.2.2 [software/homer] - 10https://gerrit.wikimedia.org/r/594770 [18:14:26] (03PS1) 10Ryan Kemper: Add 'Ryan Kemper' [puppet] - 10https://gerrit.wikimedia.org/r/594771 [18:14:28] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/594771 (owner: 10Ryan Kemper) [18:14:58] (03CR) 10jerkins-bot: [V: 04-1] Add 'Ryan Kemper' [puppet] - 10https://gerrit.wikimedia.org/r/594771 (owner: 10Ryan Kemper) [18:16:46] (03PS1) 10Mforns: web::fetches::analytics::job: do not rsync mediawiki if missing source [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) [18:19:59] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={atlas_exporter,pdu_sentry4} site={codfw,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:20:49] (03PS2) 10Ryan Kemper: Add 'Ryan Kemper' [puppet] - 10https://gerrit.wikimedia.org/r/594771 [18:21:45] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:26:13] DannyS712: can you test on mwdebug1001? [18:26:39] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.34 [software/spicerack] - 10https://gerrit.wikimedia.org/r/594769 (owner: 10Volans) [18:27:33] @twentyafterfour confirmed to work [18:27:46] DannyS712: thank you very much! deploying it [18:28:57] !log twentyafterfour@deploy1001 Synchronized php-1.35.0-wmf.31/includes/specials/pagers/DeletedContribsPager.php: sync https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/594768/ fixes T252043 (duration: 01m 08s) [18:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:02] T252043: Special:DeletedContributions: $row does not contain fields needed for comment rev_comment - https://phabricator.wikimedia.org/T252043 [18:29:27] https://www.mediawiki.org/wiki/Special:DeletedContributions/DannyS712 now works [18:29:45] woot! [18:30:28] may have another issue - https://phabricator.wikimedia.org/T252052 [18:32:38] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.2.2 [software/homer] - 10https://gerrit.wikimedia.org/r/594770 (owner: 10Volans) [18:32:39] DannyS712: we're doing a log triage meeting rn, will raise [18:32:53] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.34 [software/spicerack] - 10https://gerrit.wikimedia.org/r/594769 (owner: 10Volans) [18:32:57] (03CR) 10Ayounsi: [C: 03+1] "Tested." [software/homer] - 10https://gerrit.wikimedia.org/r/594765 (owner: 10Volans) [18:32:59] DannyS712: indeed that error is showing up in logstash a lot [18:34:43] (03CR) 10Herron: "the change itself LGTM, but please see a few nitpicks about the commit message" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594771 (owner: 10Ryan Kemper) [18:35:03] (03PS1) 10Volans: Upstream release v0.0.34 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/594777 [18:35:48] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.34 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/594777 (owner: 10Volans) [18:35:54] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.2.2 [software/homer] - 10https://gerrit.wikimedia.org/r/594770 (owner: 10Volans) [18:41:33] (03PS1) 10Volans: Upstream release v0.2.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/594779 [18:41:34] (03Merged) 10jenkins-bot: Upstream release v0.0.34 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/594777 (owner: 10Volans) [18:42:27] (03CR) 10Volans: [V: 03+2 C: 03+2] Upstream release v0.2.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/594779 (owner: 10Volans) [18:43:59] !log volans@deploy1001 Started deploy [homer/deploy@8224f0a]: Release v0.2.2 [18:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:18] !log volans@deploy1001 Finished deploy [homer/deploy@8224f0a]: Release v0.2.2 (duration: 00m 18s) [18:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:05] !log uploaded spicerack_0.0.34-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [18:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:13] twentyafterfour: going offline for a sec to fight with network, hopefully back shortly [18:48:02] 10Operations, 10ops-eqiad: Degraded RAID on kafka-jumbo1001 - https://phabricator.wikimedia.org/T251586 (10Cmjohnson) @elukey I need to power this server off, I am not able to reach the mgmt/idrac for the server and need access to pull the required Dell report. [18:54:04] !log upgraded spicerack to spicerack_0.0.34-1_amd64.deb on cumin[12]001 [18:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:52] (03CR) 10Bstorm: [C: 03+2] d/changelog: prepare for 0.69 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/594579 (owner: 10Bstorm) [18:55:02] (and back) [18:55:33] brennen: the second patch just merged [18:55:36] I'm gonna deploy it now [18:55:43] ta [18:55:45] (03Merged) 10jenkins-bot: d/changelog: prepare for 0.69 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/594579 (owner: 10Bstorm) [18:55:57] can someone +2 the master patch? the backport just merged [18:57:31] +2'd [18:58:01] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Tobias Andersson to the ldap/wmde group - https://phabricator.wikimedia.org/T251997 (10RStallman-legalteam) @toan - Greetings! I will send the NDA to your WMDE email address via docusign. Thanks. [18:58:21] !log twentyafterfour@deploy1001 Synchronized php-1.35.0-wmf.31/includes/specials/pagers/DeletedContribsPager.php: deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/594778/ fixes UBN T252052 (duration: 01m 09s) [18:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:24] T252052: Fix Undefined property: stdClass::$ar_id from Revision/RevisionArchiveRecord.php line 74 - https://phabricator.wikimedia.org/T252052 [19:00:04] brennen and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200506T1900). [19:00:31] brennen: how's your connectivity now? [19:00:49] train looks to be unblocked [19:01:31] twentyafterfour: i'm back, lagging a little but that's probably just everybody in town hitting refresh on everything all at once - should be good to roll train. [19:02:43] brennen: 👍 [19:03:05] !log 1.35.0-wmf.31 train unblocked (T249963), rolling forward to group0 [19:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:11] T249963: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 [19:03:34] 🚆chooch [19:03:44] !log CORRECTION: 1.35.0-wmf.31 train unblocked (T249963), rolling forward to group1 [19:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:24] (03PS1) 10Brennen Bearnes: group1 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594784 [19:04:26] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594784 (owner: 10Brennen Bearnes) [19:05:09] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594784 (owner: 10Brennen Bearnes) [19:07:44] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.31 [19:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:53] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.31 (duration: 01m 08s) [19:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:23] /w/api.php ErrorException from line 393 of /srv/mediawiki/php-1.35.0-wmf.31/includes/debug/MWDebug.php: PHP Warning: [data-update-failed]: A data update callback triggered an exception (Fail-safe exception. Avoiding infinite loop due to possibily undetectable existing records in master. [19:10:29] nice message ;) [19:11:34] hashar: i think that one is pre-existing... ? checking... [19:13:04] brennen@mwlog1001:/srv/mw-log$ grep 'data-update-failed' ./error.log | grep wmf.31 -c [19:13:06] 6 [19:13:08] brennen@mwlog1001:/srv/mw-log$ grep 'data-update-failed' ./error.log | grep wmf.30 -c [19:13:10] 212 [19:13:17] no match in phabricator for "Avoiding infinite loop" [19:13:30] but there is a closed task [19:13:39] https://phabricator.wikimedia.org/T247553 [19:13:53] but that is from March [19:16:34] hashar: reopen or new task, you think? [19:16:58] I would guess a new task [19:17:01] and ping the old task [19:17:08] filing. [19:17:12] the traces are slightly different [19:17:17] might be different deferred update jobs [19:18:01] it is not like I have any clue what that message can represent anyway [19:19:41] looks like there is a lock race of some sort [19:22:11] brennen: looks mostly fine beside that [19:22:17] I am off to take care a bit of kids [19:22:19] bbl [19:22:32] ack, thanks. [19:33:43] brennen: that is where it would be helpful to do the deployment over a conf call/ with a group of knowledgeable folks ;D [19:33:58] then there are so many things that could exception out that it would probably mean a lot of people [19:35:07] (03CR) 10Cwhite: [C: 03+2] smart: prepare collect_smart_metrics for handling devices of different types [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [19:37:25] hashar: yeah, i like the idea of people attending deploys generally. [19:37:36] (03CR) 10Cwhite: [C: 03+2] mtail: add flag to install mtail from apt component [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [19:37:38] i think the train log triage meeting is a step in that direction. [19:37:52] (03PS9) 10Cwhite: mtail: add flag to install mtail from apt component [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) [19:38:50] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Jclark-ctr) @MoritzMuehlenhoff So i am unable to ssh to cumin1001 ` ssh: Could not resolve hostname cumin1001.eqiad.wmnet: nodename nor servname provided, or not kn... [19:42:16] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10CDanis) @Jclark-ctr You should edit your ~/.ssh/config file as detailed at https://wikitech.wikimedia.org/wiki/Production_access#Setting_up_your_SSH_config and then it... [19:46:31] (03CR) 10Ottomata: web::fetches::analytics::job: do not rsync mediawiki if missing source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [19:51:27] (03PS6) 10Ottomata: Initial debian commit [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) [19:53:05] (03PS2) 10Mforns: web::fetches::analytics::job: do not rsync mediawiki if missing source [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) [19:54:25] (03CR) 10Mforns: "Thanks for the review!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [20:00:04] halfak and accraze: Dear deployers, time to do the Services – Graphoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200506T2000). [20:12:43] (03CR) 10QChris: [C: 04-1] "Thanks for the pointer to this commit. I see that the change is from" [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/574092 (https://phabricator.wikimedia.org/T200739) (owner: 10Thcipriani) [20:13:29] (03CR) 10Ottomata: Initial debian commit (031 comment) [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [20:19:42] (03PS1) 10Andrew Bogott: openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) [20:20:45] (03CR) 10jerkins-bot: [V: 04-1] openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) (owner: 10Andrew Bogott) [20:22:26] (03PS2) 10Andrew Bogott: openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) [20:23:31] (03CR) 10jerkins-bot: [V: 04-1] openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) (owner: 10Andrew Bogott) [20:24:08] (03PS3) 10Andrew Bogott: openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) [20:25:16] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Blog, 10Patch-For-Review: Update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10colewhite) [20:26:26] (03PS4) 10Andrew Bogott: openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) [20:28:19] (03PS5) 10Andrew Bogott: openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) [20:34:22] (03PS6) 10Andrew Bogott: openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) [20:35:22] (03CR) 10jerkins-bot: [V: 04-1] openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) (owner: 10Andrew Bogott) [20:39:52] (03PS1) 10RLazarus: pcc: Parse `git log` correctly for HEAD/last/latest change number. [puppet] - 10https://gerrit.wikimedia.org/r/594797 [20:43:28] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WQDS Data Reload - https://phabricator.wikimedia.org/T252068 (10Gehel) [20:44:00] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WQDS Data Reload - https://phabricator.wikimedia.org/T252068 (10RKemper) `wdqs1010` is one of our test servers Therefore can screw up `wdqs1010` as much as we want, but not the others --- we need to transfer that data... [20:44:09] PROBLEM - MegaRAID on analytics1055 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:44:10] 10Operations, 10DC-Ops: determine/process/document bios firmware tracking/updating policies - https://phabricator.wikimedia.org/T141128 (10wiki_willy) Chatted with @Volans yesterday for a little bit on best way we should approach doing firmware upgrades going forward. My preference is that service owners have... [20:44:16] ACKNOWLEDGEMENT - MegaRAID on analytics1055 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T252070 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:44:19] 10Operations, 10ops-eqiad: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10ops-monitoring-bot) [20:45:01] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WQDS Data Reload - https://phabricator.wikimedia.org/T252068 (10RKemper) should also circle back and fix https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L2207-L2242,` wdqs200[78]` are declared twice [20:49:56] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10Jdforrester-WMF) [20:51:09] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10Volans) I'm working on the cumin release, I need to find some solution for some dependency incompatibility between code and the versions in Debian. I'll update the task with more information in the next few days. [20:57:44] (03PS1) 10Andrew Bogott: Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) [20:58:50] (03CR) 10jerkins-bot: [V: 04-1] Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) (owner: 10Andrew Bogott) [21:03:26] (03PS1) 10Herron: lists: add concept of primary and standby host, and rsync prmry -> stndby [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) [21:04:26] (03CR) 10jerkins-bot: [V: 04-1] lists: add concept of primary and standby host, and rsync prmry -> stndby [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) (owner: 10Herron) [21:04:53] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:02] (03PS2) 10Andrew Bogott: Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) [21:05:02] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:06] (03CR) 10jerkins-bot: [V: 04-1] Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) (owner: 10Andrew Bogott) [21:06:16] (03PS7) 10Andrew Bogott: openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) [21:06:18] (03PS3) 10Andrew Bogott: Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) [21:07:27] (03CR) 10jerkins-bot: [V: 04-1] Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) (owner: 10Andrew Bogott) [21:07:32] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:01] (03PS2) 10Herron: lists: add concept of primary and standby host, and rsync prmry -> stndby [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) [21:09:02] (03CR) 10jerkins-bot: [V: 04-1] lists: add concept of primary and standby host, and rsync prmry -> stndby [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) (owner: 10Herron) [21:09:38] (03PS4) 10Andrew Bogott: Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) [21:13:26] (03PS8) 10Andrew Bogott: openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) [21:13:29] (03PS5) 10Andrew Bogott: Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) [21:14:20] @brennen @twentyafterfour patch for the last UBN (T252072) is ready for review [21:14:21] T252072: Special:RevisionReview: PHP Unknown error: Object of class MediaWiki\Revision\RevisionStoreRecord could not be converted to string - https://phabricator.wikimedia.org/T252072 [21:15:36] (03PS3) 10Herron: lists: add concept of primary and standby host, and rsync prmry -> stndby [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) [21:17:04] (03PS9) 10Andrew Bogott: openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) [21:17:06] (03PS6) 10Andrew Bogott: Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) [21:20:36] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/22366/" [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) (owner: 10Herron) [21:22:31] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10colewhite) p:05Triage→03Medium [21:22:34] DannyS712: thx, i defer to Pchelolo on the actual code change, though it looks simple enough [21:22:51] no problem; if you lack at the patch that caused the issue the mistake was pretty clear [21:25:44] brennen: merged. apologies for missing this in the review [21:26:06] (03PS7) 10Andrew Bogott: Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) [21:26:26] back port is at https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/594803/ [21:29:36] hrm, i also see 89 instances of another ConvertibleTimeStamp error - wondering if that's same or related... i'm relegated to laggy cellphone tethering again atm so triaging is taking longer than usual. [21:30:02] can you post a stack trace so I can see if I caused it / how to fix it? [21:30:23] yeah, one sec. [21:34:09] (03CR) 10RLazarus: [C: 03+1] pcc: add ability to parse commit messages for `Hosts:` lines (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594708 (owner: 10Jbond) [21:37:17] Pchelolo: would you mind checking if https://logstash.wikimedia.org/goto/9cafb14071adb43d65f082199c196a6e is down to the same code as T252072? i'm having difficulty getting much of anything to load on either connection i've got available to me at the moment. [21:37:19] T252072: Special:RevisionReview: PHP Unknown error: Object of class MediaWiki\Revision\RevisionStoreRecord could not be converted to string - https://phabricator.wikimedia.org/T252072 [21:37:41] one sec [21:38:32] (03PS8) 10Andrew Bogott: Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) [21:38:59] thanks [21:39:30] brennen: ye, same thing [21:40:59] (03PS1) 10QChris: Add .gitreview [debs/logstash-filter-verifier] - 10https://gerrit.wikimedia.org/r/594806 [21:41:01] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/logstash-filter-verifier] - 10https://gerrit.wikimedia.org/r/594806 (owner: 10QChris) [21:41:07] Pchelolo: thanks [21:45:16] !log updating puppet compiler facts [21:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:06] (03PS9) 10Andrew Bogott: Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) [21:53:16] (03PS10) 10Andrew Bogott: Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) [21:57:06] (03CR) 10Andrew Bogott: [C: 03+2] openldap acls: change the type of extra_acls [puppet] - 10https://gerrit.wikimedia.org/r/594792 (https://phabricator.wikimedia.org/T252066) (owner: 10Andrew Bogott) [21:57:18] (03CR) 10Andrew Bogott: [C: 03+2] Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) (owner: 10Andrew Bogott) [21:57:27] (03PS11) 10Andrew Bogott: Openldap: split the clouddev ldap server into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/594799 (https://phabricator.wikimedia.org/T252066) [22:00:02] DannyS712: want to confirm that on mwdebug1001? [22:06:03] (^ or Pchelolo) [22:14:27] 10Operations, 10Discovery-Search, 10SDC General, 10Structured Data Engineering, and 2 others: Create CQS puppet configs by applying query_service module - https://phabricator.wikimedia.org/T237089 (10EBernhardson) Matt's initial work has gotten us most of the way there. In reviewing whats available now, an... [22:15:32] sorry I missed this; testing [22:17:35] @brennen confirmed to work on mwdebug1001 [22:18:15] DannyS712: cool, ty. deploying. [22:22:41] !log brennen@deploy1001 Synchronized php-1.35.0-wmf.31/includes/revisionlist/RevisionItem.php: [[gerrit:594803|RevisionItem: Fix providing timestamp in getRevisionLink ]] (duration: 01m 09s) [22:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:23] meanwhile: does T252077 warrant a rollback? [22:25:23] T252077: WikimediaPrometheusQueryServiceLagProvider: PHP Warning: A non-numeric value encountered - https://phabricator.wikimedia.org/T252077 [22:26:17] https://github.com/wikimedia/puppet/blob/6b0dc71f153b6f052eb117c72ed365aaedc12a4d/modules/profile/manifests/mediawiki/maintenance/wikidata.pp#L73 [22:26:21] (it is a cron, yeah) [22:27:03] * Reedy looks [22:27:25] thx. [22:28:55] addshore: You about still [22:30:29] brennen: I'm presuming time() isn't broken in PHP... [22:30:37] well one can hope. [22:30:42] :) [22:30:52] So I'm guessing it's bad data from the prometheus service [22:30:57] No recent changes to the code [22:31:20] yeah, makes sense. in that case i'll unblock. [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200506T2300). [23:00:05] RoanKattouw and hmonroy: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:37] Ready. Not looking forward to getting a sticker. [23:00:44] I'll do the SWAT [23:00:48] And I'll try not to earn us a sticker [23:00:51] kk, thanks! [23:01:04] (03CR) 10Catrope: [C: 03+2] Enable password-reset-update on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585371 (https://phabricator.wikimedia.org/T245791) (owner: 10Samwilson) [23:04:25] (03PS3) 10Catrope: Enable password-reset-update on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585371 (https://phabricator.wikimedia.org/T245791) (owner: 10Samwilson) [23:05:25] (03CR) 10Catrope: Enable password-reset-update on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585371 (https://phabricator.wikimedia.org/T245791) (owner: 10Samwilson) [23:05:29] (03CR) 10Catrope: [C: 03+2] Enable password-reset-update on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585371 (https://phabricator.wikimedia.org/T245791) (owner: 10Samwilson) [23:06:18] (03Merged) 10jenkins-bot: Enable password-reset-update on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585371 (https://phabricator.wikimedia.org/T245791) (owner: 10Samwilson) [23:07:39] hmonroy: Your patch is on mwdebug1002, please test [23:08:19] Awesome! Testing [23:09:06] The change is there and everything else looks good! [23:09:35] Reedy a bit [23:09:41] *reads up* [23:09:41] (03PS3) 10Ryan Kemper: icinga: grant 'Ryan Kemper' acesss to web UI [puppet] - 10https://gerrit.wikimedia.org/r/594771 (https://phabricator.wikimedia.org/T251572) [23:18:13] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable password-reset-update on Wikipedias (T245791) (duration: 01m 07s) [23:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:18] T245791: Enable PRU for all other projects [small] - https://phabricator.wikimedia.org/T245791 [23:18:19] (03PS2) 10Catrope: GrowthExperiments: Disable guidance feature on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594529 [23:18:24] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Disable guidance feature on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594529 (owner: 10Catrope) [23:18:42] hmonroy: Great! It's now live everywhere [23:18:54] Thank you Roan!! [23:19:09] (03Merged) 10jenkins-bot: GrowthExperiments: Disable guidance feature on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594529 (owner: 10Catrope) [23:31:48] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [23:32:00] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [23:50:55] 10Operations, 10SRE-Access-Requests: Requesting Access to sites from Google Search Console - https://phabricator.wikimedia.org/T251128 (10colewhite) Hi @ahemmer! In order to complete this task, we just need sign-off from your manager. [23:59:17] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable GrowthExperiments guidance on testwiki (duration: 01m 07s) [23:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log