[01:56:49] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:23:17] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [03:54:37] PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:18:03] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:21:07] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:44:33] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:12:32] (03PS1) 10Marostegui: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505692 (https://phabricator.wikimedia.org/T136427) [05:13:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505692 (https://phabricator.wikimedia.org/T136427) (owner: 10Marostegui) [05:14:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505692 (https://phabricator.wikimedia.org/T136427) (owner: 10Marostegui) [05:16:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool all slaves in x1 T136427 (duration: 00m 54s) [05:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:38] T136427: Remove event_page_namespace and event_page_title - https://phabricator.wikimedia.org/T136427 [05:16:55] !log Deploy schema change on x1 master - lag will appear on x1 slaves - T136427 [05:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505692 (https://phabricator.wikimedia.org/T136427) (owner: 10Marostegui) [05:32:24] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505693 [05:52:20] !log powercycle wtp2019 - no ssh, mgmt console stuck [05:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:05] RECOVERY - Host wtp2019 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [05:56:42] 10Operations, 10ops-eqiad: wtp2019 shows error messages in the racadm getsel's output - https://phabricator.wikimedia.org/T221572 (10elukey) [05:56:52] should I tag Ops at https://phabricator.wikimedia.org/T221570 ? I am not sure the existing tags will make people effectively aware of that UBN . TY! [05:57:09] 10Operations, 10ops-codfw: wtp2019 shows error messages in the racadm getsel's output - https://phabricator.wikimedia.org/T221572 (10elukey) [05:57:49] Elitre: I am not sure Operations can really do anything about that, it looks more code related? [05:58:26] Elitre: o/ thanks for the ping, if SRE don't have any actionable I think it is sufficient to have a note in here [05:58:56] marostegui: thanks for the input, I really dunno. hey elukey :) [05:59:12] I guess I wanted my minute of fame in an incident report again :p [06:06:33] Elitre: Not sure either which tags, maybe SpecialPages? [06:07:08] Mediawiki-Special-Pages? [06:12:52] marostegui: I guess I was looking for the name of the team or person who could be looking into this. [06:17:28] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505693 (owner: 10Marostegui) [06:18:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505693 (owner: 10Marostegui) [06:20:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool all slaves in x1 T136427 (duration: 00m 57s) [06:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:14] T136427: Remove event_page_namespace and event_page_title - https://phabricator.wikimedia.org/T136427 [06:22:06] (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: install git-lfs on buster [puppet] - 10https://gerrit.wikimedia.org/r/505694 (https://phabricator.wikimedia.org/T148843) [06:22:58] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::packages::statistics: install git-lfs on buster [puppet] - 10https://gerrit.wikimedia.org/r/505694 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [06:24:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505693 (owner: 10Marostegui) [06:33:05] (03PS1) 10Marostegui: mariadb: Allow install from db2103 to db2120 [puppet] - 10https://gerrit.wikimedia.org/r/505695 (https://phabricator.wikimedia.org/T221532) [06:38:54] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) >>! In T148843#5125683, @dr0ptp4kt wrote: > (Detour) > > @Nur... [07:10:27] (03PS1) 10Marostegui: mariadb: Promote db2079 to s8 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/505697 (https://phabricator.wikimedia.org/T220170) [07:11:55] (03CR) 10Marostegui: [C: 04-1] "Requires topology changes first" [puppet] - 10https://gerrit.wikimedia.org/r/505697 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [07:12:14] (03PS1) 10Marostegui: db-codfw.php: Promote db2079 to s8 codfw master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505699 (https://phabricator.wikimedia.org/T220170) [07:12:45] (03CR) 10Marostegui: [C: 04-1] "Requires topologies changes first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505699 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [07:20:22] (03PS15) 10Vgutierrez: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk) [07:20:25] (03PS1) 10Gilles: Set wgPriorityHintsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505701 (https://phabricator.wikimedia.org/T216499) [07:23:42] (03CR) 10Gilles: [C: 03+2] Set wgPriorityHintsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505701 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles) [07:27:36] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T216499 Set wgPriorityHintsRatio (duration: 00m 52s) [07:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:42] T216499: Priority Hints origin trial - https://phabricator.wikimedia.org/T216499 [07:28:07] (03PS3) 10Elukey: admin: add the analytics system user to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) [07:30:09] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) a:03Gilles [07:30:16] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) I'll give it a try... [07:31:16] (03CR) 10jenkins-bot: Set wgPriorityHintsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505701 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles) [07:39:48] 10Operations, 10DC-Ops, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10jijiki) @faidon @Dzahn user wpao is in LDAP 'ops' group, but not in the admin.yaml 'ops' group which makes cross-validate-accounts script complain. What should we do ? [07:49:27] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: remove google safe browsing monitoring [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) (owner: 10Dzahn) [07:49:53] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: remove google safe browsing monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) (owner: 10Dzahn) [07:50:58] (03CR) 10Vgutierrez: [C: 03+2] tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk) [07:52:15] (03CR) 10Elukey: [C: 04-1] "scap::target { 'analytics/refinery':" [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [07:56:04] (03PS3) 10Ema: cache: remove unused purge-related hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/504895 (https://phabricator.wikimedia.org/T219967) [08:03:46] (03CR) 10Ema: [C: 03+2] cache: remove unused purge-related hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/504895 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [08:04:50] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: update swift dashboard with legacy metric name fallback [puppet] - 10https://gerrit.wikimedia.org/r/504993 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite) [08:04:59] (03PS2) 10Filippo Giunchedi: grafana: update swift dashboard with legacy metric name fallback [puppet] - 10https://gerrit.wikimedia.org/r/504993 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite) [08:06:07] !log installing wget security updates on jessie [08:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:35] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [08:09:37] (03CR) 10Filippo Giunchedi: [C: 03+2] "> Patch Set 3:" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503714 (owner: 10Alex Monk) [08:09:42] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] deployment-prep: Add stretch storage hosts [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503714 (owner: 10Alex Monk) [08:09:44] (03PS13) 10Vgutierrez: acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [08:11:47] (03PS4) 10Elukey: admin: add the analytics-hdfs system user to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) [08:14:21] !log removing debmonitor entries for labvirt* hosts [08:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:30] (03CR) 10Vgutierrez: [C: 03+2] "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1002/15933/" [puppet] - 10https://gerrit.wikimedia.org/r/501461 (owner: 10Alex Monk) [08:20:42] (03PS3) 10Vgutierrez: wikiba.se TLS: Make support for different certificate sources clearer [puppet] - 10https://gerrit.wikimedia.org/r/501461 (owner: 10Alex Monk) [08:22:54] (03PS3) 10Vgutierrez: archiva::proxy: remove old letsencrypt module stuff [puppet] - 10https://gerrit.wikimedia.org/r/504648 (https://phabricator.wikimedia.org/T221268) (owner: 10Alex Monk) [08:25:09] (03CR) 10Vgutierrez: [C: 03+2] "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1002/15934/" [puppet] - 10https://gerrit.wikimedia.org/r/504648 (https://phabricator.wikimedia.org/T221268) (owner: 10Alex Monk) [08:25:41] 10Operations, 10monitoring, 10Patch-For-Review: prometheus1004 /srv/prometheus/ops almost full - https://phabricator.wikimedia.org/T220326 (10fgiunchedi) >>! In T220326#5120519, @Dzahn wrote: > The specific check this was about, disk on prometheus 1004, now has the Icinga link: > > https://icinga.wikimedia.... [08:28:39] 10Operations, 10Puppet, 10Traffic, 10Patch-For-Review: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Krenair) [08:33:22] 10Operations, 10Traffic: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 (10Vgutierrez) [08:33:34] 10Operations, 10Traffic: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 (10Vgutierrez) p:05Triage→03Normal [08:38:30] (03Abandoned) 10Alex Monk: certcentral: Reload instead of restart when config changes [puppet] - 10https://gerrit.wikimedia.org/r/470866 (owner: 10Alex Monk) [08:41:18] (03CR) 10Alex Monk: [C: 04-1] "(The note in the commit message here about certcentral/acme-chief no longer applies.)" [puppet] - 10https://gerrit.wikimedia.org/r/437640 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [08:42:18] (03PS2) 10Vgutierrez: CI: Run tests with minimum and latest dependencies [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497715 (https://phabricator.wikimedia.org/T213820) [08:42:41] Krenair: BTW, maybe we should reconsider https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/497715 [08:42:53] the blocker disappeared after upgrading to buster dependencies [08:44:48] (03PS14) 10Matthias Mullie: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) (owner: 10Jforrester) [08:44:53] (03PS1) 10Ema: conftool-data: add ats-be to cache_upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/505708 (https://phabricator.wikimedia.org/T219967) [08:46:23] !log fdans@deploy1001 Started deploy [analytics/refinery@0d63671]: deploying changes to pageview definition brought in refinery source 0.0.87 [08:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:22] (03PS2) 10Gehel: Fix DCATAP dump loading [puppet] - 10https://gerrit.wikimedia.org/r/504992 (https://phabricator.wikimedia.org/T221405) (owner: 10Smalyshev) [08:49:12] (03CR) 10Alex Monk: [C: 03+2] CI: Run tests with minimum and latest dependencies [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497715 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [08:49:24] (03CR) 10Gehel: [C: 03+2] Fix DCATAP dump loading [puppet] - 10https://gerrit.wikimedia.org/r/504992 (https://phabricator.wikimedia.org/T221405) (owner: 10Smalyshev) [08:51:17] (03PS1) 10Muehlenhoff: Remove account expiry date for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/505711 [08:51:56] (03Merged) 10jenkins-bot: CI: Run tests with minimum and latest dependencies [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497715 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [08:52:14] (03PS1) 10Filippo Giunchedi: hieradata: prometheus v2 on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/505712 (https://phabricator.wikimedia.org/T187987) [08:54:29] !log synchronizing old docker_registry content into new one - T221101 [08:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:34] T221101: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 [08:54:39] (03CR) 10jenkins-bot: CI: Run tests with minimum and latest dependencies [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497715 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [08:54:51] Krenair: thx :D [08:55:05] np [08:55:19] I've got a couple of really old (certcentral) changes sitting around in Gerrit still [08:55:33] I should probably reupload them against acme-chief and fix them up at some point [08:58:27] PROBLEM - dhclient process on etcd1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.42: Connection reset by peer [08:58:39] PROBLEM - Check size of conntrack table on etcd1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.42: Connection reset by peer [08:58:43] PROBLEM - Check whether ferm is active by checking the default input chain on etcd1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.42: Connection reset by peer [08:58:45] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:58:49] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [08:59:05] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:59:09] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={GET,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:59:11] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:59:25] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:59:35] RECOVERY - dhclient process on etcd1003 is OK: PROCS OK: 0 processes with command name dhclient [08:59:47] RECOVERY - Check size of conntrack table on etcd1003 is OK: OK: nf_conntrack is 0 % full [08:59:53] RECOVERY - Check whether ferm is active by checking the default input chain on etcd1003 is OK: OK ferm input default policy is set [08:59:55] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:00:01] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [09:00:32] !log fdans@deploy1001 Finished deploy [analytics/refinery@0d63671]: deploying changes to pageview definition brought in refinery source 0.0.87 (duration: 14m 09s) [09:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:43] 10Operations, 10Core Platform Team, 10DBA, 10MediaWiki-Database, and 3 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Marostegui) The query reported at https://phabricator.wikimedia.org/T221380#5127416 is differe... [09:01:21] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:01:43] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:02:44] etcd alerts around there matches perfectly with a Apr 23 08:55:02 etcd1003 puppet-agent-cronjob: Sleeping 6 for random splay [09:02:45] Apr 23 08:55:18 etcd1003 puppet-agent-cronjob: INFO:debmonitor:Found 448 installed binary packages [09:02:45] Apr 23 08:55:18 etcd1003 puppet-agent-cronjob: INFO:debmonitor:Found 18 upgradable binary packages (including new dependencies) [09:02:45] Apr 23 08:55:19 etcd1003 puppet-agent-cronjob: INFO:debmonitor:Successfully sent the upgradable update to the DebMonitor server [09:03:21] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:03:33] https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?orgId=1&var-server=etcd1003&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now-1m [09:03:53] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:04:12] that cronjob affected etcd which is quite sensitive to load [09:04:25] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:04:25] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:04:33] https://www.irccloud.com/pastebin/DkPtqbll/ [09:04:55] and created related alerts to kubernetes-api due to etcd slowness [09:05:47] moritzm: since v.olans is not around, is that cronjob that load intensive usually? [09:06:19] (03PS2) 10Ema: conftool-data: add ats-be to cache_upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/505708 (https://phabricator.wikimedia.org/T219967) [09:07:02] (03CR) 10Ema: [C: 03+2] conftool-data: add ats-be to cache_upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/505708 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:10:07] fsero: it just syncs the package state with the debmonitor db, e.g. for packages which were installed with dpkg [09:10:31] FYI I'll be starting bast5001 migration to prometheus v2 shortly, that will cause historical metrics for eqsin to not be available until migration + backfill is complete [09:11:21] that's most definitely unrelated, that cron can't reasonably ramp up load to the effect observed [09:13:28] moritzm: is the only thing that i see that happened at 08:55, not saying it's the culprit but is the only thing i saw [09:13:59] I'll have a look at systemd logs later, but it can't be triggered by the debmonitor run, it's running daily on all hosts, we'd see this pattern for all etcd hosts more often [09:14:33] for reference other runs of that cron didnt affect at all [09:14:36] so you are right moritzm [09:15:07] it shouldnt affect that much [09:15:25] that's the cron that runs puppet agent afaics, it could be actions from puppet itself too [09:15:30] 10Operations, 10Core Platform Team, 10DBA, 10MediaWiki-Database, and 3 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10jcrespo) Let's keep them separated **for now**, my bet is they are the same underlying issue,... [09:15:39] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) Thumbor was very straightforward, it only required a tiny patch to relax a python version check in setup.py (bringing it in line with thumbor master). My build... [09:16:35] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) [09:16:53] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: prometheus v2 on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/505712 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [09:17:01] (03PS2) 10Filippo Giunchedi: hieradata: prometheus v2 on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/505712 (https://phabricator.wikimedia.org/T187987) [09:19:53] moritzm: is shdubsh still on duty ? [09:19:55] !log dumping Kafka consumer offsets' history on logstash1012 for T221202 [09:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:00] T221202: kafka-logging __consumer_offsets topic traffic increased - https://phabricator.wikimedia.org/T221202 [09:20:15] jijiki: per the SRE pad it is herron [09:20:19] this week [09:20:29] just to update the topic [09:20:32] I cant :) [09:23:15] !log upgrade prometheus to v2 on bast5001, previous metrics will not be available until migration and backfill are complete - T187987 [09:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:20] T187987: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 [09:28:00] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) I couldn't find a repo for the python-thumbor-community-core package Debian sources. I assume that the build we have is based on what I put together here origi... [09:28:10] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) [09:28:24] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) [09:29:07] moritzm: found, is a ganetiVM and ganeti host was having a hard time due to drbd [09:29:52] (03PS26) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [09:29:57] for reference https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ganeti&var-instance=All&from=1556000985396&to=1556011785396 [09:31:12] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) [09:33:37] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 58.68 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:34:45] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:34:55] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 79.5 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:35:28] (03PS1) 10Ema: cache: stop passing route_table to varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/505724 (https://phabricator.wikimedia.org/T219967) [09:35:55] godog: is the eqsin alert above related to bast5001 maintenance? [09:36:49] ema: yeah [09:36:50] I assume so, just double-checking :) [09:36:52] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) Same story got python-manhole, took it from https://github.com/gi11es/thumbor-debian/tree/master/python-manhole It build as-is. To be found with the other ones... [09:37:02] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) [09:37:12] thanks for checking! [09:37:23] (03PS1) 10Alex Monk: deployment-prep: Change jessie storage hosts to weight 0 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505725 [09:42:12] (03CR) 10Ema: [C: 03+2] cache: stop passing route_table to varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/505724 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:46:13] (03PS11) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [09:48:16] (03PS2) 10Jcrespo: mariadb-backups: Set dbprov100[12] as spare for reimage [puppet] - 10https://gerrit.wikimedia.org/r/504562 (https://phabricator.wikimedia.org/T219399) [09:48:56] (03CR) 10Jbond: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [09:49:01] (03PS27) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [09:49:05] (03PS1) 10Gilles: Buster compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) [09:50:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/505667 (owner: 10Aklapper) [09:53:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [09:54:17] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10aborrero) [09:54:22] (03PS28) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [09:54:53] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/505711 (owner: 10Muehlenhoff) [09:57:06] (03PS2) 10Muehlenhoff: Remove account expiry date for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/505711 [10:00:04] (03PS29) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [10:04:15] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Set dbprov100[12] as spare for reimage [puppet] - 10https://gerrit.wikimedia.org/r/504562 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo) [10:04:59] (03PS2) 10Marostegui: mariadb: Allow install from db2103 to db2120 [puppet] - 10https://gerrit.wikimedia.org/r/505695 (https://phabricator.wikimedia.org/T221532) [10:05:07] (03PS2) 10Gilles: Buster compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) [10:07:05] (03CR) 10Alex Monk: "This is live now and disk usage on the old hosts is dropping" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505725 (owner: 10Alex Monk) [10:14:15] 10Operations, 10DC-Ops, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10MoritzMuehlenhoff) >>! In T221142#5129395, @jijiki wrote: > @faidon @Dzahn user wpao is in LDAP 'ops' group, but not in the admin.yaml 'ops' group which makes cross-validate-accounts script complai... [10:15:15] (03PS1) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) [10:15:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Allow install from db2103 to db2120 [puppet] - 10https://gerrit.wikimedia.org/r/505695 (https://phabricator.wikimedia.org/T221532) (owner: 10Marostegui) [10:16:05] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10aborrero) [10:16:06] (03CR) 10jerkins-bot: [V: 04-1] kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [10:21:11] (03PS3) 10Gilles: Buster compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) [10:22:00] (03CR) 10Fsero: "LGTM overall add some comments" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [10:23:23] (03PS2) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) [10:27:29] (03PS4) 10Gilles: Buster compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) [10:31:06] (03PS1) 10Filippo Giunchedi: logstash: ramp up logs retention [puppet] - 10https://gerrit.wikimedia.org/r/505741 (https://phabricator.wikimedia.org/T220103) [10:33:32] (03CR) 10Marostegui: [C: 03+1] "The last patch looks good" [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [10:37:13] (03PS30) 10Ema: role::cache::upload_ats: Varnish frontend / ATS backend setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [10:39:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-wmde group addition for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10Lucas_Werkmeister_WMDE) Seems to work so far, thank you :) [10:39:31] jouncebot: next [10:39:31] In 0 hour(s) and 20 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190423T1100) [10:44:20] !log uploaded ferm 2.4-1+wmf2+deb10u1 to buster-wikimedia (T153468) [10:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:26] T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 [10:50:30] (03PS5) 10Gilles: Buster compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) [10:50:46] (03PS1) 10Ema: profile::trafficserver::backend: do not configure vhtcpd [puppet] - 10https://gerrit.wikimedia.org/r/505748 (https://phabricator.wikimedia.org/T219967) [10:52:28] (03CR) 10Gilles: "The mysterious hanging on that trivial test is worrying, I'll keep investigating. But for now this is enough to build and test a Buster De" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [10:54:18] (03PS31) 10Ema: role::cache::upload_ats: Varnish frontend / ATS backend setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [10:54:53] (03CR) 10Effie Mouzeli: [C: 03+2] webperf: Remove arclamp subscriber from mwlog servers [puppet] - 10https://gerrit.wikimedia.org/r/503675 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [10:55:16] (03PS3) 10Effie Mouzeli: webperf: Remove arclamp subscriber from mwlog servers [puppet] - 10https://gerrit.wikimedia.org/r/503675 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [11:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190423T1100). [11:00:04] kart_, rxy, bmansurov, and Krenair: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:07] o/ [11:00:09] here [11:00:11] here [11:00:47] hey [11:01:25] Krenair: SWAT'ng? [11:02:08] no, I have not held deployment access for over two years [11:03:14] OK. Let me start with my patch and see. I'm yet learning :) [11:03:27] ok :) [11:04:20] is any SWATter around in case things go wrong? [11:04:32] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505220 (https://phabricator.wikimedia.org/T221353) (owner: 10Petar.petkovic) [11:04:55] Lucas_WMDE: not sure. zeljkof, around? [11:07:41] (03PS2) 10KartikMistry: Use higher unmodified MT threshold for Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505220 (https://phabricator.wikimedia.org/T221353) (owner: 10Petar.petkovic) [11:08:00] sigh [11:10:12] dereckson: hi? [11:11:24] Testing my patch on mwdebug.. [11:11:27] in your case you'll want to sync InitialiseSettings before CommonSettings [11:11:33] the rest all looks straightforward [11:12:39] Hi [11:12:45] rxy: how can I help you? [11:12:59] can you handle SWAT in this time? [11:13:07] I don't have my ssh key with me [11:13:32] k, thanks [11:13:52] Krenair: my patch? [11:14:10] yes [11:14:30] (03CR) 10jenkins-bot: Use higher unmodified MT threshold for Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505220 (https://phabricator.wikimedia.org/T221353) (owner: 10Petar.petkovic) [11:14:47] kart_ does SWAT in this time (includes another patches)? [11:15:21] Krenair: I can sync folder, right or not right method? [11:16:51] https://wikitech.wikimedia.org/wiki/How_to_deploy_code doesn't actually mention sync-dir anymore. I have a feeling they changed the process at some point since my time [11:17:08] sync-file also supports directories as far as I’m aware [11:17:46] (03PS1) 10Jbond: kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987) [11:17:48] yep. sync-file only wmf-config works. [11:18:48] !log kartik@deploy1001 Synchronized wmf-config: SWAT: [[gerrit:505220]] Use higher unmodified MT threshold for Indonesian Wikipedia (T221353) (duration: 00m 57s) [11:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:55] T221353: Make more strict the check for unmodified content for the whole document on Indonesian Wikipedia - https://phabricator.wikimedia.org/T221353 [11:20:03] and it seems throwing error :/ [11:21:38] Krenair: Was that the reason about syncing CommonSetting first? [11:22:16] https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 [11:22:40] I don't have logstash access [11:22:57] your CommonSettings code relies on wmgContentTranslationUnmodifiedMTThresholdForPublish being set, but you're only creating that variable in this commit [11:23:00] ah OK. [11:23:21] meaning every apache that gets CommonSettings before InitialiseSettings will throw errors until it gets InitialiseSettings [11:24:05] Krenair: so, it is just need to wait or need to revert? [11:24:09] you probably saw someting gabout wmgContentTranslationUnmodifiedMTThresholdForPublish being undefined in the logs? [11:24:26] yes [11:24:29] Notice: Undefined variable: wmgContentTranslationUnmodifiedMTThresholdForPublish in /srv/mediawiki/wmf-config/CommonSettings.php on line 3167 [11:24:57] It probably assumes a value of 0. [11:25:10] For example: [11:25:11] php > var_dump($a + 1); [11:25:11] PHP Notice: Undefined variable: a in php shell code on line 1 [11:25:11] int(1) [11:26:04] If it's still throwing errors I'd suggest syncing InitialiseSettings on it's own, explicitly. [11:26:21] If it stopped then there's nothing to be done. [11:26:34] Let me touch and sync. [11:26:35] PROBLEM - puppet last run on analytics1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:27:23] !log installing clamav security updates on mendelevium (OTRS host) [11:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:29] That error likely shows in the logs but hopefully not for users. [11:27:55] And hopefully the value being assumed to be 0 does not have any harmful effects, but you'd be the expert on that [11:29:02] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Fix undefined variable from last SWAT (duration: 00m 54s) [11:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:30] are the undefined variable notices dying down now? [11:29:35] Krenair: going down now. [11:29:40] cool. [11:29:59] I'll wait for a while. Sorry about it, folks. [11:30:27] I think I'll add note about such deployment in the page. ToDo for tonight. [11:31:08] I have a feeling they removed sync-dir for a reason :) [11:32:01] Indeed. [11:32:38] There is a note about performing operations in the correct order on the how To Deploy Code page on wikitech [11:32:56] scap sync-dir is the same as scap sync-file now [11:32:57] usage: scap sync-dir [-h] [--conf CONF_FILE] [--no-shared-authsock] [11:32:57] [-D :] [-v] [--environment ENVIRONMENT] [11:32:57] [--no-log-message] [--force] [11:32:57] file [message [message ...]] [11:32:59] Sync a specific file/directory to the cluster. [11:33:19] Noted! [11:33:41] hm, apparently it just got deprecated: https://wikitech.wikimedia.org/w/index.php?title=How_to_deploy_code&diff=1426908&oldid=1008107 [11:34:41] bmansurov: around? [11:34:51] bmansurov: we can go with your patch. [11:35:07] yes [11:35:11] let's do it [11:35:13] OK. Merging. [11:35:27] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505643 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [11:36:37] I should rebase :/ [11:36:57] (03PS2) 10KartikMistry: Turn off logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505643 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [11:40:32] bmansurov: on mwdebug1002. [11:40:37] ok testing [11:40:38] bmansurov: test and let me know. [11:41:26] kart_: looks good, please continue [11:42:24] cool. deploying.. [11:42:48] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10MoritzMuehlenhoff) All bust... [11:43:11] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:505643]] Turn off logging for CitationUsage and CitationUsagePageLoad (T213969) (duration: 00m 53s) [11:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:17] T213969: Citation Usage: run third round of data collection - https://phabricator.wikimedia.org/T213969 [11:43:22] Done. [11:43:35] Krenair: your turn now.. [11:44:21] ok [11:44:26] Krenair: Is it possible to sync it together or need separate sync? [11:44:29] mine should be no-ops in prod [11:44:33] you can sync together [11:44:37] OK. [11:44:53] (03PS2) 10KartikMistry: deployment-prep: Use new poolcounter instance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505059 (owner: 10Alex Monk) [11:45:17] it's just LabsServices.php which is not used from prod servers [11:45:19] !log Stop xenon-log, excimer-log and apache on mwlog* [11:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:33] kart_: thank you! [11:45:36] (03CR) 10Alexandros Kosiaris: First version of the kask chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [11:45:42] it does, as a matter of process, get sync'd in prod though [11:46:13] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505059 (owner: 10Alex Monk) [11:46:32] (03PS2) 10Alexandros Kosiaris: First version of the kask chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) [11:47:11] (03Merged) 10jenkins-bot: deployment-prep: Use new poolcounter instance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505059 (owner: 10Alex Monk) [11:47:58] (03PS2) 10KartikMistry: deployment-prep: Use new ms-fe host. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505060 (owner: 10Alex Monk) [11:49:23] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505060 (owner: 10Alex Monk) [11:50:17] Krenair: do you need to test using mwdebug? [11:50:22] no [11:50:26] (03Merged) 10jenkins-bot: deployment-prep: Use new ms-fe host. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505060 (owner: 10Alex Monk) [11:50:27] OK [11:52:24] sigh. [11:52:42] Krenair: accidently tab. But I'll log it once done. [11:53:02] thanks [11:53:05] !log kartik@deploy1001 Synchronized wmf-config/LabsServices.php: SWAT: [[gerrit:505643]] (duration: 00m 53s) [11:53:05] RECOVERY - puppet last run on analytics1060 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:32] !log 'SWAT: [[gerrit:505059]] deployment-prep: Use new poolcounter instance, [[gerrit:505060]] deployment-prep: Use new ms-fe host.' [11:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:53] OK. So we don't have much time for rxy's patch. [11:55:08] (03CR) 10jenkins-bot: Turn off logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505643 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [11:55:10] (03CR) 10jenkins-bot: deployment-prep: Use new poolcounter instance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505059 (owner: 10Alex Monk) [11:55:12] (03CR) 10jenkins-bot: deployment-prep: Use new ms-fe host. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505060 (owner: 10Alex Monk) [11:55:12] :O [11:55:35] Should I re-schedule? or not? [11:55:44] rxy: Sorry about that. Given my inexperience with wmf branch deployment, you should reschedule. [11:55:47] Please. [11:55:56] ok, thanks. [11:56:25] !log EU-Midday SWAT is done. [11:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:49] 10Operations, 10DC-Ops, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10jijiki) >>! In T221142#5129920, @MoritzMuehlenhoff wrote: >>>! In T221142#5129395, @jijiki wrote: >> @faidon @Dzahn user wpao is in LDAP 'ops' group, but not in the admin.yaml 'ops' group which mak... [11:58:23] rxy: Ping me once you done. I think we should consider wmf branch deployment first as CI will take more time. [11:59:20] !log installing clamav security updates on fermium [11:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:34] haha. sure. core CI is taking many time. [12:01:01] Hope that commit of rxy's makes it out in the next week... [12:01:31] (03PS3) 10Muehlenhoff: Remove account expiry date for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/505711 [12:02:26] Thanks. :D [12:03:50] (03CR) 10Muehlenhoff: [C: 03+2] Remove account expiry date for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/505711 (owner: 10Muehlenhoff) [12:11:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:12:22] 10Operations: Broken puppet in the 'logging' project - https://phabricator.wikimedia.org/T221450 (10fgiunchedi) a:05fgiunchedi→03None [12:12:31] rxy: curious is all the ci checks taking a while to finish or to start? [12:12:35] 10Operations: Broken puppet in the 'logging' project - https://phabricator.wikimedia.org/T221450 (10fgiunchedi) Fixed `filippo-log-jessie01` ! [12:12:50] (03PS12) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [12:13:57] (03CR) 10Jbond: [C: 03+2] raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:14:48] Zppix: We need to +2 at SWAT, by middle of SWAT, it will be merge :) [12:15:17] kart_: so it is taking long to finish? [12:15:41] (03CR) 10Fsero: First version of the kask chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [12:15:45] !log swift eqiad-prod: fully decom ms-be1013 - T220590 [12:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:51] T220590: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 [12:18:30] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220907 (10fgiunchedi) >>! In T220907#5120604, @Cmjohnson wrote: > @fgiunchedi do you want to power off unplug and power on...that will clear the issue Yes please drain the power! Thanks [12:24:00] (03PS1) 10Effie Mouzeli: thumbor: Use port 8800 for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/505759 (https://phabricator.wikimedia.org/T187765) [12:24:29] (03CR) 10jerkins-bot: [V: 04-1] thumbor: Use port 8800 for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/505759 (https://phabricator.wikimedia.org/T187765) (owner: 10Effie Mouzeli) [12:26:13] (03PS2) 10Effie Mouzeli: thumbor: Use port 8800 for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/505759 (https://phabricator.wikimedia.org/T187765) [12:27:33] (03CR) 10Alexandros Kosiaris: First version of the kask chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [12:28:44] (03PS3) 10Alexandros Kosiaris: First version of the kask chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) [12:29:11] (03CR) 10Fsero: First version of the kask chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [12:33:10] (03PS3) 10Muehlenhoff: redis::instance: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/500415 [12:38:45] (03PS1) 10Jbond: RAID: replace hpssacli with sscli [puppet] - 10https://gerrit.wikimedia.org/r/505760 (https://phabricator.wikimedia.org/T220787) [12:40:41] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10jbond) Raid checks now appear to be working with the new ssacli tool. The latest CR (https://gerrit.wikimedia.org/r/50... [12:42:55] (03PS1) 10Arturo Borrero Gonzalez: ldap: sssd: add missing bits for sudo [puppet] - 10https://gerrit.wikimedia.org/r/505761 (https://phabricator.wikimedia.org/T221225) [12:45:40] (03PS2) 10Arturo Borrero Gonzalez: ldap: sssd: add missing bits for sudo [puppet] - 10https://gerrit.wikimedia.org/r/505761 (https://phabricator.wikimedia.org/T221225) [12:46:23] (03PS2) 10Ema: profile::trafficserver::backend: do not configure vhtcpd [puppet] - 10https://gerrit.wikimedia.org/r/505748 (https://phabricator.wikimedia.org/T219967) [12:47:26] (03CR) 10Ema: [C: 03+2] profile::trafficserver::backend: do not configure vhtcpd [puppet] - 10https://gerrit.wikimedia.org/r/505748 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [12:47:42] (03PS3) 10Arturo Borrero Gonzalez: ldap: sssd: add missing bits for sudo [puppet] - 10https://gerrit.wikimedia.org/r/505761 (https://phabricator.wikimedia.org/T221225) [12:52:16] (03PS7) 10Vgutierrez: acme_chief: Prevalidate CN/SNI list [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) [12:52:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ldap: sssd: add missing bits for sudo [puppet] - 10https://gerrit.wikimedia.org/r/505761 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [12:53:33] (03CR) 10Vgutierrez: "This change is ready for review." (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [12:57:39] (03PS5) 10Filippo Giunchedi: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) [12:57:41] (03PS1) 10Filippo Giunchedi: WIP elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) [12:58:07] (03PS32) 10Ema: role::cache::upload_ats: Varnish frontend / ATS backend setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [12:58:19] (03CR) 10jerkins-bot: [V: 04-1] WIP elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [13:02:18] (03PS1) 10Tulsi Bhagat: Add namespace "Aldono" at eo.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505765 [13:05:36] (03PS1) 10Ema: cache: reimage cp4021 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/505767 (https://phabricator.wikimedia.org/T219967) [13:07:47] (03CR) 10Ema: [C: 03+2] role::cache::upload_ats: Varnish frontend / ATS backend setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:11:04] (03PS2) 10Tulsi Bhagat: Add namespace "Aldono" at eo.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505765 (https://phabricator.wikimedia.org/T221525) [13:16:09] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: connect to address 10.64.0.123 port 5666: Connection refused [13:16:54] godog: we seem to have a problem with prometheus1003 [13:16:59] PROBLEM - puppet last run on prometheus1003 is CRITICAL: connect to address 10.64.0.123 port 5666: Connection refused [13:17:03] PROBLEM - DPKG on prometheus1003 is CRITICAL: connect to address 10.64.0.123 port 5666: Connection refused [13:17:11] PROBLEM - Disk space on prometheus1003 is CRITICAL: connect to address 10.64.0.123 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:17:59] !log Restart nagios-nrpe-server on prometheus1003 [13:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:15] ema: checking, thank you [13:18:27] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational [13:19:01] godog: it said nagios-nrpe-server.service: Failed to fork: Cannot allocate memory [13:19:03] RECOVERY - DPKG on prometheus1003 is OK: All packages OK [13:19:11] RECOVERY - Disk space on prometheus1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:19:21] indeed, looks like a whole lot of memory allocated [13:19:30] my hunch would be a heavy query [13:19:38] very heavy :) [13:19:51] very very heavy :) [13:20:08] like 80G heavy? [13:20:30] I was checking kafka metrics, used 30d selector (but ~10 mins ago), not sure if I am the culprit [13:20:53] ema: very very very heavy [13:21:54] elukey: on which dashboard? [13:22:01] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [13:23:09] but yeah IME https://grafana.wikimedia.org/d/000000027/kafka can issue heavy queries [13:24:21] (03PS1) 10Muehlenhoff: Puppetise kadm5.acl [puppet] - 10https://gerrit.wikimedia.org/r/505771 [13:25:09] godog: kafka dashboard [13:26:31] yeah that's probably it [13:27:02] :( sorry [13:28:30] (03CR) 10Effie Mouzeli: [C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/15943/mc1021.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/500415 (owner: 10Muehlenhoff) [13:28:34] (03CR) 10Effie Mouzeli: [C: 03+2] redis::instance: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/500415 (owner: 10Muehlenhoff) [13:28:42] (03PS4) 10Effie Mouzeli: redis::instance: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/500415 (owner: 10Muehlenhoff) [13:28:43] not your fault elukey ! ideally dashboards don't have pitfalls like that [13:31:25] (03CR) 10Alex Monk: [C: 03+2] acme_chief: Prevalidate CN/SNI list [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [13:32:37] wikilove to elukey [13:32:56] (03PS2) 10Filippo Giunchedi: WIP elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) [13:33:08] godog: can we do something about it? Like breaking down the dashboard in multiple ones? [13:33:27] (03CR) 10jerkins-bot: [V: 04-1] WIP elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [13:34:01] (03CR) 10Elukey: [C: 03+1] Puppetise kadm5.acl [puppet] - 10https://gerrit.wikimedia.org/r/505771 (owner: 10Muehlenhoff) [13:34:27] 10Operations, 10Traffic: Puppet broken on two VMs in the 'traffic' project - https://phabricator.wikimedia.org/T221454 (10ema) p:05Triage→03Normal [13:35:34] (03PS3) 10Filippo Giunchedi: WIP elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) [13:36:38] 10Operations, 10Traffic: Puppet broken on two VMs in the 'traffic' project - https://phabricator.wikimedia.org/T221454 (10ema) 05Open→03Resolved Fixed the former, deleted the latter. Thanks for the reminder! [13:37:03] elukey: the easiest win I think is probably to use recording rules for the expensive queries [13:37:38] to your question, yes also having less data per dashboard would help [13:38:07] ack so I'll open a task to work on it [13:38:16] (03PS1) 10Alex Monk: deployment-prep: Remove old jessie hosts [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505773 [13:43:15] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:44:07] (03PS4) 10Filippo Giunchedi: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) [13:46:00] (03CR) 10CDanis: [V: 03+2 C: 03+2] deployment-prep: Change jessie storage hosts to weight 0 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505725 (owner: 10Alex Monk) [13:47:36] (03PS2) 10Muehlenhoff: Puppetise kadm5.acl [puppet] - 10https://gerrit.wikimedia.org/r/505771 [13:48:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] thumbor: Use port 8800 for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/505759 (https://phabricator.wikimedia.org/T187765) (owner: 10Effie Mouzeli) [13:49:35] (03CR) 10Muehlenhoff: [C: 03+2] Puppetise kadm5.acl [puppet] - 10https://gerrit.wikimedia.org/r/505771 (owner: 10Muehlenhoff) [13:50:21] (03CR) 10Alex Monk: "This is live now" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505773 (owner: 10Alex Monk) [13:51:39] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10jcrespo) @Harej My question is more like, is the summary still accurate about the result of the conversations? (e.g. rampup of... [13:54:49] !log depool cp4021 and reimage as upload_ats T219967 [13:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:54] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [13:55:25] (03PS2) 10Ema: cache: reimage cp4021 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/505767 (https://phabricator.wikimedia.org/T219967) [13:56:14] (03CR) 10Ema: [C: 03+2] cache: reimage cp4021 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/505767 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:58:16] 10Operations: Broken puppet in the 'logging' project - https://phabricator.wikimedia.org/T221450 (10herron) [13:58:30] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10jijiki) @Krinkle I have stopped and disabled xenon-log, excimer-log, and apache on mwlog* servers, and I have removed the arclamp-generate-... [13:59:03] 10Operations: Broken puppet in the 'logging' project - https://phabricator.wikimedia.org/T221450 (10herron) 05Open→03Resolved a:03herron Deleted the remaining instances [13:59:50] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4021.ulsfo.wmnet'] ` The log can be... [14:07:11] (03CR) 10Filippo Giunchedi: elastalert: new module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [14:07:17] !log Disable puppet on thumbor* to merge 505759 [14:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:52] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: Use port 8800 for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/505759 (https://phabricator.wikimedia.org/T187765) (owner: 10Effie Mouzeli) [14:09:07] (03PS3) 10Effie Mouzeli: thumbor: Use port 8800 for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/505759 (https://phabricator.wikimedia.org/T187765) [14:13:15] jbond42, moritzm, there are four VMs in the 'puppet' project that are failing their puppet runs. Usually I ignore things in that project but I'm trying to shut down the nameserver that they rely on so today it matters :) jbond-buster, jbond-jessie, jbond-puppet-client, jmm-buster, puppet-jmm-pmaster-client [14:13:38] lmk if you're able to fix/delete things or if I need to make an attempt myself :) [14:13:54] having a look [14:13:54] andrewbogott: they should all be fixed [14:14:17] jbond42_: by 'should' you mean you think they're already fixed? [14:14:27] * andrewbogott rechecks [14:14:28] yes this morning just checking now [14:14:34] !log Depool thumbor1001 for 505759 and pool back - T187765 [14:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:39] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [14:14:53] yep, puppet-jmm-pmaster-client just completed a puppet run [14:14:57] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:15:06] andrewbogott: yes they look good, tried top ping you earlier but you where offline [14:15:21] ah, ok — sorry, IRC kicked me out of a bunch of channels [14:15:29] no problem [14:15:59] jbond42_: looks like puppet is running properly on jbond-jessie but it's not picking up recent changes. Without looking I would guess that that means it's using a local puppetmaster and that puppetmaster has local changes so it can't be automatically rebased? [14:16:25] ahh yes one second i will sync my stadalone pupet master [14:16:28] 10Operations, 10Analytics, 10Analytics-Cluster: furud - DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error - https://phabricator.wikimedia.org/T221483 (10Ottomata) a:05Dzahn→03Ottomata Thanks! We should probably unpuppetize the Hadoop part of these nodes and unmount /mnt/hdfs until we need... [14:16:39] !log Depool thumbor2001 for 505759 and pool back - T187765 [14:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:50] 10Operations, 10Analytics, 10Analytics-Cluster: furud - DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error - https://phabricator.wikimedia.org/T221483 (10Ottomata) a:05Ottomata→03Dzahn [14:17:00] godog, hey, would you like your home files from deployment-ms-be0[34] rescued before I shut them down? looks like you just have some old packages and a prometheus script there [14:17:20] 10Operations, 10Analytics, 10Analytics-Cluster: Remove Hadoop configs and unmount /mnt/hdfs from unused backup hosts (furud, +) - https://phabricator.wikimedia.org/T221629 (10Ottomata) [14:17:48] (03PS1) 10Muehlenhoff: Kerberos: Tighten supported_enctypes [puppet] - 10https://gerrit.wikimedia.org/r/505777 [14:18:45] I expect to leave them shut down for a week or two before deletion, suspect those files don't see much use thoughg [14:21:39] !log Depool thumbor1002 for 505759 and pool back - T187765 [14:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:44] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [14:23:55] (03CR) 10Elukey: [C: 03+1] Kerberos: Tighten supported_enctypes [puppet] - 10https://gerrit.wikimedia.org/r/505777 (owner: 10Muehlenhoff) [14:25:12] jbond42_: did you look at the -jmm- instances too or just the jbond- ones? [14:25:38] andrewbogott: my instances are all fine as well [14:25:56] hm, cumin can't reach one of them [14:25:56] "puppet-jmm-kernel-stretch2.puppet.eqiad.wmflabs": "Permission denied (publickey).", [14:26:05] looking [14:26:19] Krenair: thanks for the heads up! delete away [14:26:31] great [14:26:34] (03PS1) 10Alex Monk: deployment-prep: Delete hieradata from deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/505779 [14:26:36] and puppet-jmm-pmaster-client isn't getting the updated nameservers so there must be a rebase issue there [14:27:37] !log Depool thumbor2002 for 505759 and pool back - T187765 [14:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:43] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [14:28:01] (03PS1) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [14:28:08] andrewbogott: puppet-jmm-kernel-stretch2.puppet.eqiad.wmflab is fixed now, looking into the git sync for puppet-jmm-pmaster-client next [14:28:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] deployment-prep: Delete hieradata from deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/505779 (owner: 10Alex Monk) [14:28:27] moritzm: thank you! [14:29:12] (03PS2) 10Alexandros Kosiaris: deployment-prep: Delete hieradata from deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/505779 (owner: 10Alex Monk) [14:32:47] andrewbogott: puppet-jmm-pmaster-client is now also up-to-date, there was a broken local commit on it's puppet master and a malformed Hiera setting all fixed now [14:32:56] moritzm: great, thank you! [14:33:15] 10Operations, 10Puppet, 10puppet-compiler: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10herron) Here are related puppet agent and puppetmaster1001 apache logs from a sampling of hosts ` Apr 20 21:38:22 planet1001 puppet-agent[7083]: Using configured environment 'production' Apr 20... [14:33:35] (03CR) 10CDanis: [V: 03+2 C: 03+2] deployment-prep: Remove old jessie hosts [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505773 (owner: 10Alex Monk) [14:33:52] cdanis, thanks for all your help with these swift-ring changes! [14:33:57] np! [14:35:19] 10Operations, 10netops: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10herron) p:05Triage→03Normal [14:35:46] (03PS3) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) [14:35:48] (03PS2) 10Jbond: kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987) [14:35:49] (03PS1) 10Jbond: ulogd logstash: Add rule to parse ulogd ouput to json [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) [14:35:55] (03PS2) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [14:36:35] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10elukey) As reference: ` root@install1002:/srv/wikimedia# reprepro lsbycomponent thumbor thumbor | 6.3.2+git20170607-1 | jessie-wikimed... [14:36:47] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [14:38:51] (03PS2) 10Jbond: ulogd logstash: Add rule to parse ulogd ouput to json [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) [14:39:22] (03CR) 10jerkins-bot: [V: 04-1] ulogd logstash: Add rule to parse ulogd ouput to json [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [14:40:31] (03PS3) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [14:43:12] (03PS4) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [14:43:15] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Cwek) I am not sure if it is influential, but I still have to report it. I'm from mainland China. As we all kno... [14:44:35] (03PS3) 10Jbond: ulogd logstash: Add rule to parse ulogd ouput to json [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) [14:46:10] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/505777 (owner: 10Muehlenhoff) [14:46:40] (03PS2) 10Muehlenhoff: Kerberos: Tighten supported_enctypes [puppet] - 10https://gerrit.wikimedia.org/r/505777 [14:49:10] (03CR) 10Ottomata: "thnx! one comment" [puppet] - 10https://gerrit.wikimedia.org/r/505373 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [14:49:43] ottomata, I think gerrit eat your comment :/ [14:51:27] (03CR) 10Muehlenhoff: [C: 03+2] Kerberos: Tighten supported_enctypes [puppet] - 10https://gerrit.wikimedia.org/r/505777 (owner: 10Muehlenhoff) [14:56:31] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) @elukey are the sources for these debian packages in version control somewhere? I ended up packaging the current sid version of Thumbor,... [14:56:49] (03PS1) 10Ottomata: Enable api-request logging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505787 (https://phabricator.wikimedia.org/T214080) [14:57:45] 10Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670 (10jcrespo) (also not the same disk slot, so different issues and should be tracked separately) [14:59:48] (03PS1) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) [15:00:04] matthiasmullie: (Dis)respected human, time to deploy Structured Data on Commons Phase II deployment ("depicts") (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190423T1500). Please do the needful. [15:00:33] yep [15:01:14] (03PS15) 10Matthias Mullie: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) (owner: 10Jforrester) [15:03:16] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10MoritzMuehlenhoff) >>! In T221562#5130813, @Gilles wrote: > @elukey are the sources for these debian packages in version control somewhere? I end... [15:03:32] (03CR) 10Matthias Mullie: [C: 03+2] SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) (owner: 10Jforrester) [15:04:26] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4021.ulsfo.wmnet'] ` and were **ALL** successful. [15:04:36] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) That makes sense. I'll try that for the thumbor package, I think it's best to decouple the thumbor version upgrade from the Buster upgrade. [15:04:51] (03Merged) 10jenkins-bot: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) (owner: 10Jforrester) [15:06:16] It did eat my comment... [15:06:17] ? [15:06:33] (03CR) 10jenkins-bot: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) (owner: 10Jforrester) [15:07:59] (03CR) 10Ottomata: network::constants: Move various analytics special_hosts to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505373 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [15:09:08] ottomata, good idea [15:15:44] (03PS1) 10Ema: cache: add ATS hiera settings to role upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/505789 (https://phabricator.wikimedia.org/T219967) [15:18:03] (03CR) 10Ema: [C: 03+2] cache: add ATS hiera settings to role upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/505789 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [15:18:22] 10Operations, 10Core Platform Team, 10DBA, 10MediaWiki-Database, and 4 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Marostegui) This is now waiting on reviewers for the patchset at T221458#5130519 https://gerri... [15:24:09] (03PS5) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [15:24:36] (03PS1) 10Alex Monk: base::firewall: Send all special host groups via single parameter [puppet] - 10https://gerrit.wikimedia.org/r/505793 [15:25:06] (03CR) 10jerkins-bot: [V: 04-1] base::firewall: Send all special host groups via single parameter [puppet] - 10https://gerrit.wikimedia.org/r/505793 (owner: 10Alex Monk) [15:25:11] (03CR) 10Alex Monk: network::constants: Move various analytics special_hosts to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505373 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [15:26:01] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10elukey) Me or @jijiki can pick the deb sources (with patches if needed) and rebuild them for buster-wikimedia on boron, and then upload them. Bas... [15:28:33] !log mlitn@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SDC: Enable Depicts functionality on Commons (duration: 00m 54s) [15:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:23] (03CR) 10Bstorm: "I don't want mitaka. The versioning is meaningless for mitaka/newton on stretch for the clientpackages. I just want the required python3" [puppet] - 10https://gerrit.wikimedia.org/r/505339 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [15:35:51] (03PS6) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [15:37:32] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10jijiki) I am happy to do it, but: I am afraid it will have to wait for next week, as this is a short week for me:) [15:38:44] PROBLEM - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: sgebastion class instances not spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:41:36] (03PS1) 10Ema: trafficserver: run apt-get update before installing [puppet] - 10https://gerrit.wikimedia.org/r/505799 (https://phabricator.wikimedia.org/T219967) [15:42:40] (03PS1) 10Paladox: Merge tag 'v2.15.13' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/505801 [15:45:11] (03CR) 10Bstorm: cloudstore: add python3 clientpackages for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505339 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [15:48:12] (03CR) 10Ema: [C: 03+2] trafficserver: run apt-get update before installing [puppet] - 10https://gerrit.wikimedia.org/r/505799 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [15:48:54] (03PS2) 10Alex Monk: base::firewall: Send all special host groups via single parameter [puppet] - 10https://gerrit.wikimedia.org/r/505793 [15:52:07] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10User-Smalyshev, 10cloud-services-team (Kanban): Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10bd808) [15:54:02] Reedy: thanks for sticking around! [15:54:16] no problem :) [15:54:20] Didn't have to do very much ;) [15:54:58] those are my favorite days! [15:55:45] (03PS1) 10Reedy: Set sqwikiquote $wgLocalTimezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505804 (https://phabricator.wikimedia.org/T221627) [15:58:41] (03PS1) 10Ema: trafficserver: avoid apt dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/505805 (https://phabricator.wikimedia.org/T219967) [15:59:08] (03CR) 10Paladox: "Not needed now, done in https://gerrit.wikimedia.org/r/#/c/operations/software/gerrit/+/505801/-1..1" [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/502487 (owner: 10DCausse) [16:00:04] godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190423T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:37] (03CR) 10Ema: [C: 03+2] trafficserver: avoid apt dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/505805 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [16:04:32] (03CR) 10Reedy: [C: 03+2] Set sqwikiquote $wgLocalTimezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505804 (https://phabricator.wikimedia.org/T221627) (owner: 10Reedy) [16:05:07] (03CR) 10Herron: [C: 03+1] "I like it!" [puppet] - 10https://gerrit.wikimedia.org/r/505741 (https://phabricator.wikimedia.org/T220103) (owner: 10Filippo Giunchedi) [16:05:15] (03PS1) 10Brion VIBBER: Update brion's production key for new laptop [puppet] - 10https://gerrit.wikimedia.org/r/505807 [16:05:37] (03Merged) 10jenkins-bot: Set sqwikiquote $wgLocalTimezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505804 (https://phabricator.wikimedia.org/T221627) (owner: 10Reedy) [16:05:51] (03CR) 10jenkins-bot: Set sqwikiquote $wgLocalTimezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505804 (https://phabricator.wikimedia.org/T221627) (owner: 10Reedy) [16:07:03] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: set wglocaltimezone for sqwikiquote T221627 (duration: 00m 54s) [16:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:16] (03CR) 10Jbond: [C: 04-1] "i think we should re think how this is done see comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [16:07:28] T221627: sq.wikiquote.org shows different time when you're logged in and when you're logged out - https://phabricator.wikimedia.org/T221627 [16:08:30] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10RobH) [16:08:48] (03PS4) 10Bstorm: cloudstore: add python3 clientpackages for all [puppet] - 10https://gerrit.wikimedia.org/r/505339 (https://phabricator.wikimedia.org/T209527) [16:09:21] (03CR) 10Herron: elastalert: new module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [16:12:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/505807 (owner: 10Brion VIBBER) [16:12:11] (03PS2) 10Herron: phabricator: remove rfc1918 ip4 addrs from SPF record [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288) [16:12:32] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [16:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:35] whee [16:12:39] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [16:12:41] (03PS3) 10Herron: phabricator: remove rfc1918 ip4 addrs from SPF record [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288) [16:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:44] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labtestservices2001.wikimedia.org` - labtestservices2001.wikimedia.org - Rem... [16:13:15] 10Operations, 10Security-Team: Add jfishback to security@ alias in exim - https://phabricator.wikimedia.org/T221661 (10JFishback_WMF) [16:13:16] PROBLEM - Host 208.80.153.51 is DOWN: PING CRITICAL - Packet loss = 100% [16:13:21] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10RobH) [16:13:41] (03CR) 10Herron: "> I re-dug through the history on this and reconfirmed everything," [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288) (owner: 10Herron) [16:14:32] (03PS1) 10RobH: decommission labtestservices2001 production dns [dns] - 10https://gerrit.wikimedia.org/r/505810 (https://phabricator.wikimedia.org/T218022) [16:16:18] (03PS1) 10RobH: decom labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/505812 (https://phabricator.wikimedia.org/T218022) [16:16:22] 10Operations, 10Security-Team: Add jfishback to security@ alias in exim - https://phabricator.wikimedia.org/T221661 (10sbassett) p:05Triage→03Normal [16:17:02] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Revoke production prometheus fundraising access - https://phabricator.wikimedia.org/T217355 (10cwdent) a:03cwdent [16:17:13] (03CR) 10RobH: [C: 03+2] decommission labtestservices2001 production dns [dns] - 10https://gerrit.wikimedia.org/r/505810 (https://phabricator.wikimedia.org/T218022) (owner: 10RobH) [16:17:20] 10Operations, 10SRE-Access-Requests, 10monitoring, 10Patch-For-Review: Allow Bryan Davis to downtime alerts in Icinga - https://phabricator.wikimedia.org/T220887 (10herron) 05Open→03Resolved Looks like this is complete, but if any follow up is needed please don't hesitate to re-open! [16:17:30] (03PS1) 10Lucas Werkmeister (WMDE): Add WikibaseSchema to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505813 (https://phabricator.wikimedia.org/T221650) [16:17:38] (03CR) 10RobH: [C: 03+2] decom labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/505812 (https://phabricator.wikimedia.org/T218022) (owner: 10RobH) [16:18:11] (03CR) 10Hashar: "I have crafted this change in order to not forget about doing the configuration change. Indeed we should not have merged that over Easter" [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar) [16:18:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:20:52] (03PS1) 10Ema: cache: distinguish between upload and upload_ats nodes [puppet] - 10https://gerrit.wikimedia.org/r/505815 (https://phabricator.wikimedia.org/T219967) [16:21:32] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10RobH) [16:21:58] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10RobH) a:05RobH→03Papaul Ready for the remainder of decom steps, then removal from racks, thanks! [16:22:32] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10herron) 05Open→03Resolved a:03herron Resolving as checklist in description has been completed [16:22:35] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:23:44] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505793 (owner: 10Alex Monk) [16:24:01] (03CR) 10ArielGlenn: "Verified on hangout that it's jis key" [puppet] - 10https://gerrit.wikimedia.org/r/505807 (owner: 10Brion VIBBER) [16:24:14] (03PS2) 10ArielGlenn: Update brion's production key for new laptop [puppet] - 10https://gerrit.wikimedia.org/r/505807 (owner: 10Brion VIBBER) [16:25:14] (03CR) 10ArielGlenn: [C: 03+2] Update brion's production key for new laptop [puppet] - 10https://gerrit.wikimedia.org/r/505807 (owner: 10Brion VIBBER) [16:25:27] (03Abandoned) 10DCausse: Add a new extension point SshExecuteCommandInterceptor [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/502487 (owner: 10DCausse) [16:26:08] (03PS1) 10Lucas Werkmeister (WMDE): Define wmgUseWikibaseSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505816 (https://phabricator.wikimedia.org/T221651) [16:26:15] (03PS1) 10Jbond: kafka: It was pointed out that kafak shipping may not work for all hosts [puppet] - 10https://gerrit.wikimedia.org/r/505817 (https://phabricator.wikimedia.org/T220987) [16:26:49] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:26:51] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:27:32] (03CR) 10Ottomata: base::firewall: Send all special host groups via single parameter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505793 (owner: 10Alex Monk) [16:29:31] checking the memcached errors [16:30:04] ottomata and Pchelolo: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) EventGate Analytics deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190423T1630). [16:30:17] (03PS1) 10Herron: add jfishback to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/505818 (https://phabricator.wikimedia.org/T221660) [16:30:30] again mc1029's tx bandwidth saturation [16:30:33] (03PS6) 10Filippo Giunchedi: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) [16:30:34] (03PS5) 10Filippo Giunchedi: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) [16:31:13] !log added jfishback to wmf ldap group T221660 [16:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:18] T221660: WMF LDAP Access for jfishback - https://phabricator.wikimedia.org/T221660 [16:32:11] 10Operations, 10DC-Ops, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10wiki_willy) @Dzahn - here's the public key info you requested below: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDWLuC4OYjtqt3meBmMk9aipLH4h3xiFxaGgY1iy2ZYKRD/+bHnvGkrsfTePV+1qENBv8Hn6BahmISN2OMr9VzIv4... [16:32:41] (03CR) 10Herron: [C: 03+2] add jfishback to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/505818 (https://phabricator.wikimedia.org/T221660) (owner: 10Herron) [16:32:43] (03PS3) 10Alex Monk: base::firewall: Send all special host groups via single parameter [puppet] - 10https://gerrit.wikimedia.org/r/505793 [16:33:21] (03CR) 10jerkins-bot: [V: 04-1] base::firewall: Send all special host groups via single parameter [puppet] - 10https://gerrit.wikimedia.org/r/505793 (owner: 10Alex Monk) [16:33:24] !proceeding to enable api-request eventgate-analytics logging for all wikis [16:33:40] !log proceeding to enable api-request eventgate-analytics logging for all wikis [16:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:50] (03CR) 10Ottomata: [C: 03+2] Enable api-request logging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505787 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [16:34:01] (03PS2) 10Ottomata: Enable api-request logging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505787 (https://phabricator.wikimedia.org/T214080) [16:34:27] 10Operations, 10fundraising-tech-ops, 10netops: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 (10cwdent) commit 606f45371334528bbbd51a4daa17805f1fddd7e4 (HEAD -> master, origin/master, origin/HEAD) Author: Casey Dentinger Da... [16:34:40] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) I need to find a working solution on labs first. Currently either I use thumbor-6.3.2+git20170607 and it has a bunch of test failures (pa... [16:38:20] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable api-request logging to eventgate-analytics for all wikis - T214080 (duration: 00m 53s) [16:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:25] T214080: Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate - https://phabricator.wikimedia.org/T214080 [16:39:25] (03CR) 10jenkins-bot: Enable api-request logging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505787 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [16:39:57] !log Depool thumbor1003 for 505759 and pool back - T187765 [16:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:04] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [16:40:27] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/505741 (https://phabricator.wikimedia.org/T220103) (owner: 10Filippo Giunchedi) [16:40:39] PROBLEM - High lag on wdqs1003 is CRITICAL: 3731 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:41:38] (03PS4) 10Alex Monk: base::firewall: Send all special host groups via single parameter [puppet] - 10https://gerrit.wikimedia.org/r/505793 [16:42:01] (03PS6) 10Filippo Giunchedi: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) [16:42:08] (03PS2) 10Ema: cache: distinguish between upload and upload_ats nodes [puppet] - 10https://gerrit.wikimedia.org/r/505815 (https://phabricator.wikimedia.org/T219967) [16:42:15] (03CR) 10jerkins-bot: [V: 04-1] base::firewall: Send all special host groups via single parameter [puppet] - 10https://gerrit.wikimedia.org/r/505793 (owner: 10Alex Monk) [16:43:16] !log Depool thumbor2003 for 505759 and pool back - T187765 [16:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:38] (03PS1) 10Gehel: maps: align tilerator CPU usage across all nodes [puppet] - 10https://gerrit.wikimedia.org/r/505819 [16:44:07] (03PS5) 10Alex Monk: base::firewall: Send all special host groups via single parameter [puppet] - 10https://gerrit.wikimedia.org/r/505793 [16:45:41] (03CR) 10Filippo Giunchedi: elastalert: new module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [16:45:43] (03CR) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS again (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [16:46:23] (03PS6) 10Alex Monk: base::firewall: Send all special host groups via single parameter [puppet] - 10https://gerrit.wikimedia.org/r/505793 [16:47:42] (03CR) 10Dzahn: "> until phab1003 is configured for rsync" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [16:48:59] (03PS2) 10Filippo Giunchedi: logstash: ramp up logs retention [puppet] - 10https://gerrit.wikimedia.org/r/505741 (https://phabricator.wikimedia.org/T220103) [16:49:09] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505793 (owner: 10Alex Monk) [16:49:27] !log Depool thumbor1004 for 505759 and pool back - T187765 [16:49:27] (03CR) 10Dzahn: "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [16:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:32] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [16:51:03] (03CR) 10Dzahn: "thanks! i'm just not sure if this +1 means i should just merge it or if deployment is still needed as usual with grant changes that actual" [puppet] - 10https://gerrit.wikimedia.org/r/496120 (owner: 10Dzahn) [16:51:37] (03CR) 10Ottomata: "BTW, we should def get a second opinion here. This is cleaner and more flexible IMO, but others might disagree and network constants are " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505793 (owner: 10Alex Monk) [16:52:28] (03PS1) 10Gehel: maps: smooth the tilerator load by reducing cpu assigned to tilerator [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) [16:53:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] network::constants: Move various analytics special_hosts to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505373 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [16:54:01] (03CR) 10Alex Monk: base::firewall: Send all special host groups via single parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505793 (owner: 10Alex Monk) [16:54:50] !log restart wdqs for jvm ugprade [16:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:19] !log Depool thumbor2004 for 505759 and pool back - T187765 [16:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:24] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [16:56:41] (03CR) 10Alexandros Kosiaris: base::firewall: Send all special host groups via single parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505793 (owner: 10Alex Monk) [16:59:35] (03CR) 10Jbond: [C: 04-1] puppetdb: adapt the module so it works on Cloud VPS again (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:00:04] cscott, arlolra, subbu, and halfak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190423T1700). [17:00:19] (03PS5) 10Dzahn: icinga: remove google safe browsing monitoring [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) [17:00:44] (03CR) 10Dzahn: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) (owner: 10Dzahn) [17:01:03] (03PS6) 10Dzahn: icinga: remove google safe browsing monitoring [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) [17:01:53] (03CR) 10Dzahn: [C: 03+2] icinga: remove google safe browsing monitoring [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) (owner: 10Dzahn) [17:05:01] (03CR) 10Alex Monk: "This module does work quite happily on Cloud VPS, puppet on deployment-puppetdb02 is okay." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:08:04] (03CR) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS again (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:09:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "PCC for cloud HW: https://puppet-compiler.wmflabs.org/compiler1002/15959/" [puppet] - 10https://gerrit.wikimedia.org/r/505339 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [17:18:56] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) I'm started to get an idea of what's happening with 6.5.1 getting stuck on some of our thumbor plugin tests. One of our tests, test_mult... [17:19:25] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10jijiki) @Gilles ping me when you think we are ready to build packages for buster, we would do it soon either way [17:20:40] (03CR) 10Bstorm: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:22:08] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@78985fb]: Update mobileapps to 6d3a422 (T201382 T217837) [17:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:17] T201382: mobile-html: adjust page margins/padding - https://phabricator.wikimedia.org/T201382 [17:22:17] T217837: [BUG] mobile-html article body has wrong background color - https://phabricator.wikimedia.org/T217837 [17:23:05] (03CR) 10Bstorm: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:26:14] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@78985fb]: Update mobileapps to 6d3a422 (T201382 T217837) (duration: 04m 06s) [17:26:17] (03PS6) 10Gilles: Buster compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) [17:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:31] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [17:28:01] (03PS2) 10Jbond: kafka: It was pointed out that kafak shipping may not work for all hosts [puppet] - 10https://gerrit.wikimedia.org/r/505817 (https://phabricator.wikimedia.org/T220987) [17:29:54] 10Operations, 10CX-cxserver, 10Citoid, 10Graphoid, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10Mholloway) @MSantos I think we do want to go up to OpenAPI 3; that's actually in progress right now for RESTBase and friends (T218218). [17:30:00] (03PS1) 10Alexandros Kosiaris: mariadb::ferm: Switch ferm::rule => ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/505831 [17:30:02] (03PS1) 10Alexandros Kosiaris: Add a dedicated=kask label to kask nodes [puppet] - 10https://gerrit.wikimedia.org/r/505832 (https://phabricator.wikimedia.org/T220821) [17:31:19] (03PS7) 10Gilles: Buster compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) [17:32:28] (03CR) 10Bstorm: [C: 03+1] kafka: It was pointed out that kafak shipping may not work for all hosts [puppet] - 10https://gerrit.wikimedia.org/r/505817 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [17:34:14] (03CR) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS again (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:34:59] (03CR) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS again (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:35:38] (03PS1) 10Gilles: Remove unused "multi" thumbor handler [puppet] - 10https://gerrit.wikimedia.org/r/505837 (https://phabricator.wikimedia.org/T221562) [17:36:07] (03PS1) 10Effie Mouzeli: Apply -R 200 to memcached on mc1029 [puppet] - 10https://gerrit.wikimedia.org/r/505839 (https://phabricator.wikimedia.org/T208844) [17:37:38] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Patch-For-Review, and 3 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki) [17:38:20] (03CR) 10Effie Mouzeli: [C: 03+2] Apply -R 200 to memcached on mc1029 [puppet] - 10https://gerrit.wikimedia.org/r/505839 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [17:39:13] (03CR) 10ArielGlenn: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [17:39:35] (03PS2) 10Gilles: Remove unused "multi" thumbor handler [puppet] - 10https://gerrit.wikimedia.org/r/505837 (https://phabricator.wikimedia.org/T221562) [17:39:37] (03CR) 10Jbond: [C: 04-1] puppetdb: adapt the module so it works on Cloud VPS again (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:42:26] (03CR) 10Gilles: "This makes the tests pass with Buster + thumbor 6.5.1:" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [17:43:40] !log Restarting memcached on mc1029 - T208844 [17:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:46] T208844: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 [17:45:21] (03CR) 10Alex Monk: puppetdb: adapt the module so it works on Cloud VPS again (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:46:50] (03CR) 10Jbond: [C: 04-1] puppetdb: adapt the module so it works on Cloud VPS again (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:47:40] (03PS2) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS simply again [puppet] - 10https://gerrit.wikimedia.org/r/504968 [17:49:30] Hi, gerrit appears to have a high thread count [17:49:34] https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=cpu [17:49:44] Which is causing problems (as reported in -releng) [17:50:34] paladox: yea, there was discussion but unfortunately not in public [17:50:48] ok [17:51:29] Someone should lower the timeout for http threads (from 5 mins which *may* be making the problem worse). [17:52:18] https://github.com/wikimedia/puppet/blob/production/modules/gerrit/templates/gerrit.config.erb#L129 [17:52:21] that's the config [17:53:34] (03CR) 10Jbond: puppetdb: adapt the module so it works on Cloud VPS simply again (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:54:29] +1 to lowering the timeout [17:55:11] I think the size of the editquality repo could be part of the problem and I think extra load from codesearch could be another part [17:55:20] oic [17:55:22] (03PS1) 10Paladox: Gerrit: Lower httpd.maxWait to 2mins [puppet] - 10https://gerrit.wikimedia.org/r/505847 [17:55:44] (03PS2) 10Paladox: Gerrit: Lower httpd.maxWait to 2mins [puppet] - 10https://gerrit.wikimedia.org/r/505847 [17:55:50] twentyafterfour ^^ [17:56:17] gerrit seems fast for me .... [17:56:30] yeh [17:56:31] it will [17:56:39] it will be itermittent for certain users [17:57:23] (03CR) 10Marostegui: [C: 03+1] "> thanks! i'm just not sure if this +1 means i should just merge it" [puppet] - 10https://gerrit.wikimedia.org/r/496120 (owner: 10Dzahn) [17:58:06] (03CR) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS simply again (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [17:59:29] (03CR) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS simply again (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [18:00:33] Do we want a higher thread count for http? Or is it too high? [18:02:54] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:00] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:07] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:13] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:17] (03CR) 10Muehlenhoff: Buster compatibility (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [18:04:19] (03CR) 10Bstorm: "Puppet compiler says it's a functional noop in the most confusing way possible: https://puppet-compiler.wmflabs.org/compiler1002/15961/" [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [18:04:21] (03PS2) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) [18:04:51] thread count is growing again [18:06:25] (03PS1) 10RobH: restbase200[78] decom [puppet] - 10https://gerrit.wikimedia.org/r/505849 (https://phabricator.wikimedia.org/T221134) [18:07:27] (03PS1) 10RobH: restbase200[78] prod dns decom [dns] - 10https://gerrit.wikimedia.org/r/505850 (https://phabricator.wikimedia.org/T221134) [18:07:38] (03CR) 10RobH: [C: 03+2] restbase200[78] decom [puppet] - 10https://gerrit.wikimedia.org/r/505849 (https://phabricator.wikimedia.org/T221134) (owner: 10RobH) [18:08:19] (03CR) 10RobH: [C: 03+2] restbase200[78] prod dns decom [dns] - 10https://gerrit.wikimedia.org/r/505850 (https://phabricator.wikimedia.org/T221134) (owner: 10RobH) [18:09:36] !log depool wdqs1003 to let it catch up [18:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:43] (03CR) 10Dzahn: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/496120 (owner: 10Dzahn) [18:09:56] * paladox wonders what is blocking the http threads [18:12:37] uh, i was looking at the wrong graph [18:13:00] https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=activeThreads that being normal looks fine. [18:13:03] (03CR) 10Alex Monk: "that.. looks like a puppet-compiler bug." [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [18:16:50] (03CR) 10Jbond: [C: 03+1] "no idea about the compiler issue but latest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504968 (owner: 10Bstorm) [18:19:01] (03CR) 10Herron: [C: 04-1] "I think this might be over complicating things. We already are enabling on a case by case basis by including profile::rsyslog::kafka_ship" [puppet] - 10https://gerrit.wikimedia.org/r/505817 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [18:21:40] (03PS3) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS simply again [puppet] - 10https://gerrit.wikimedia.org/r/504968 [18:22:41] (03CR) 10Jbond: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/505817 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [18:24:08] (03PS5) 10Bstorm: cloudstore: add python3 clientpackages for all [puppet] - 10https://gerrit.wikimedia.org/r/505339 (https://phabricator.wikimedia.org/T209527) [18:24:12] * Krinkle merging a wmf.1 patch for deployment, fixing a UBN [18:24:52] (03CR) 10Bstorm: [C: 03+2] cloudstore: add python3 clientpackages for all [puppet] - 10https://gerrit.wikimedia.org/r/505339 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:25:42] (03CR) 10Gilles: "It doesn't seem like /etc/os-release is any more helpful:" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [18:26:30] (03CR) 10Gilles: "I've verified that this doesn't break anything on Stretch, it's good to go." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [18:31:02] (03PS7) 10Dzahn: update mariadb grants from phab1002 to phab1003 (comments only) [puppet] - 10https://gerrit.wikimedia.org/r/496120 [18:33:14] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:33:34] PROBLEM - puppet last run on cloudvirt1012 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:34:08] PROBLEM - puppet last run on cloudvirt1028 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:35:02] PROBLEM - puppet last run on cloudvirt1026 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 7 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:35:02] PROBLEM - puppet last run on cloudvirt1027 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 7 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:35:06] (03PS1) 10RobH: phab1002 decom [puppet] - 10https://gerrit.wikimedia.org/r/505861 (https://phabricator.wikimedia.org/T221391) [18:35:14] (03PS1) 10Andrew Bogott: Pool cloudvirt1005 and 1006 [puppet] - 10https://gerrit.wikimedia.org/r/505862 (https://phabricator.wikimedia.org/T221049) [18:37:36] PROBLEM - puppet last run on cloudvirt1019 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:38:16] (03CR) 10RobH: [C: 03+2] phab1002 decom [puppet] - 10https://gerrit.wikimedia.org/r/505861 (https://phabricator.wikimedia.org/T221391) (owner: 10RobH) [18:39:12] (03PS2) 10Andrew Bogott: Pool cloudvirt1005 and 1006 [puppet] - 10https://gerrit.wikimedia.org/r/505862 (https://phabricator.wikimedia.org/T221049) [18:39:16] (03PS4) 10Ottomata: New Refine job to refine events using remote JSONSchemas [puppet] - 10https://gerrit.wikimedia.org/r/505287 (https://phabricator.wikimedia.org/T214080) [18:39:54] PROBLEM - puppet last run on cloudvirt1001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 6 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:40:20] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:41:08] PROBLEM - puppet last run on cloudvirt1002 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 11 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:41:08] PROBLEM - puppet last run on cloudvirt1008 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:41:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission phab1002/WMF4727 - https://phabricator.wikimedia.org/T221391 (10RobH) a:05RobH→03Cmjohnson [18:41:58] PROBLEM - puppet last run on cloudcontrol1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [18:42:04] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:42:12] PROBLEM - puppet last run on cloudvirt1003 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 7 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:43:02] (03PS5) 10Ottomata: New Refine job to refine events using remote JSONSchemas [puppet] - 10https://gerrit.wikimedia.org/r/505287 (https://phabricator.wikimedia.org/T214080) [18:43:18] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:43:24] PROBLEM - puppet last run on cloudnet1003 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:43:37] * Krinkle staging on mwdebug1002 [18:43:46] PROBLEM - puppet last run on cloudvirtan1005 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [18:44:16] PROBLEM - puppet last run on cloudvirt1020 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:44:18] (03CR) 10Dzahn: [C: 03+2] update mariadb grants from phab1002 to phab1003 (comments only) [puppet] - 10https://gerrit.wikimedia.org/r/496120 (owner: 10Dzahn) [18:44:32] (03PS8) 10Dzahn: update mariadb grants from phab1002 to phab1003 (comments only) [puppet] - 10https://gerrit.wikimedia.org/r/496120 [18:44:53] andrewbogott: just in case - I assume it's fine to stage/deploy mediawiki with scap at this time - RE: the above criticals [18:45:12] PROBLEM - puppet last run on cloudnet1004 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 6 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:45:12] PROBLEM - puppet last run on cloudvirt1004 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:45:31] Krinkle: should be fine, that's just a package dependency thing [18:45:37] thx [18:45:42] (03CR) 10Muehlenhoff: Buster compatibility (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [18:46:12] PROBLEM - puppet last run on cloudvirtan1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [18:46:44] PROBLEM - puppet last run on cloudvirt1025 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:47:42] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:47:50] PROBLEM - puppet last run on cloudvirt1016 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 22 seconds ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [18:48:00] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:48:03] (03CR) 10Ottomata: [C: 03+2] "tested new jar, works great! PCC is happy too." [puppet] - 10https://gerrit.wikimedia.org/r/505287 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [18:48:40] PROBLEM - puppet last run on cloudvirt1005 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 42 seconds ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:50:29] 10Operations, 10Analytics-Kanban, 10EventBus, 10netops: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10Ottomata) [18:50:46] quiddity: the patch is now live on mwdebug1002. [18:50:50] PROBLEM - puppet last run on cloudvirt1023 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [18:51:00] PROBLEM - puppet last run on cloudvirt1009 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:51:07] (03PS9) 10Dzahn: update mariadb grants from phab1002 to phab1003 (comments only) [puppet] - 10https://gerrit.wikimedia.org/r/496120 [18:51:24] PROBLEM - puppet last run on cloudvirt1021 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [18:51:52] Krinkle, same error https://www.mediawiki.org/wiki/Special:Log/massmessage [18:52:00] (But I'm guessing you're looking at logs) [18:52:04] PROBLEM - puppet last run on cloudvirt1030 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:52:25] 10Operations, 10Security-Team: Add jfishback to security@ alias in exim - https://phabricator.wikimedia.org/T221661 (10chasemp) 05Open→03Resolved Done [18:52:42] PROBLEM - puppet last run on cloudvirt1018 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:53:14] quiddity: hm.. the sync took 10 seconds longer than after I told you. [18:53:16] Try one more time/ [18:53:52] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [18:53:56] Krinkle, same. [18:54:23] (03PS1) 10Ottomata: Use webproxy for mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/505867 (https://phabricator.wikimedia.org/T215442) [18:55:02] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 73162 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:55:16] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:55:28] quiddity: strange, I don't see an error in the logs. [18:55:50] quiddity: can you try one more this time with "Log" enabled as well? (mwdebug1002, PHP 7, Log, On) [18:56:16] PROBLEM - puppet last run on cloudvirt1013 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 7 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:56:42] PROBLEM - puppet last run on cloudservices1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [18:56:44] Krinkle, done. same error. [18:57:01] thx, yeah, now I have more info to work with. [18:57:08] alright, cancelling the deploy then. [18:57:47] actaully, no, I think I know what's going on. I don't see an error because the patch is only applied to the special page requste from you, not the job runner server. [18:58:02] PROBLEM - puppet last run on cloudvirtan1002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [18:58:10] I'm gonna deploy it anyway as the fatal error did go away, so seems an improvemtn. and I suspect when it hits the job runner, ti'll work as expected. [18:58:12] PROBLEM - High lag on wdqs1003 is CRITICAL: 5428 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:58:38] PROBLEM - puppet last run on cloudvirt1017 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [18:58:49] nod. [18:59:06] PROBLEM - puppet last run on cloudvirt1006 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [18:59:10] PROBLEM - puppet last run on cloudservices1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [18:59:20] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/MassMessage: c640195 (duration: 00m 56s) [18:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:32] quiddity: alright, one more time. this time with WikimediaDebug 'Off' again. [18:59:38] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [18:59:43] (03PS2) 10Ottomata: Use webproxy for mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/505867 (https://phabricator.wikimedia.org/T215442) [19:00:46] Krinkle, Success! [19:00:49] (03CR) 10jerkins-bot: [V: 04-1] Use webproxy for mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/505867 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [19:01:06] quiddity: awesome. Probably the least-confident deploy I have done to date. [19:01:09] But glad my gut was right [19:01:22] PROBLEM - puppet last run on cloudvirt1029 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [19:01:34] PROBLEM - puppet last run on labtestvirt2003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [19:02:27] Krinkle, so does this mean I'm safe to attempt delivery across all wikis, with TechNews? Or does that need to wait for further updates? [19:02:34] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-openstackclient],Package[python3-designateclient],Package[python3-neutronclient] [19:02:37] quiddity: nope, go ahead [19:02:41] ty! [19:03:33] (03PS3) 10Ottomata: Use webproxy for mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/505867 (https://phabricator.wikimedia.org/T215442) [19:03:38] PROBLEM - puppet last run on labpuppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:04:33] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/15964/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/505867 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [19:04:42] (03CR) 10jerkins-bot: [V: 04-1] Use webproxy for mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/505867 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [19:06:59] (03PS4) 10Ottomata: Use webproxy for mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/505867 (https://phabricator.wikimedia.org/T215442) [19:08:54] (03CR) 10Ottomata: [C: 03+2] Use webproxy for mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/505867 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [19:09:12] (03PS5) 10Ottomata: Use webproxy for mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/505867 (https://phabricator.wikimedia.org/T215442) [19:09:14] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use webproxy for mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/505867 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [19:09:41] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use webproxy for mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/505867 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [19:10:15] I am going to go ahead and restart gerrit [19:11:24] !log gerrit restart [19:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:55] cool/ [19:12:11] thcipriani: for future reference what are the thresholds and parameters that you make that choice with? [19:13:08] chaomodus: in this instance, I noticed that there were blocked jvm gc threads in a thread dump. I ran jstack as the gerrit2 user to get that dump. [19:13:41] okay so if gc is in bad shape in general? [19:14:27] (03PS1) 10Muehlenhoff: Include grub::defaults unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/505886 (https://phabricator.wikimedia.org/T140100) [19:14:55] chaomodus: yes, also the thread graph has been a good analog of when gc is in trouble recently https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=activeThreads [19:15:12] (03CR) 10jerkins-bot: [V: 04-1] Include grub::defaults unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/505886 (https://phabricator.wikimedia.org/T140100) (owner: 10Muehlenhoff) [19:15:26] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [19:15:43] (03CR) 1020after4: [C: 03+1] "This could be helpful, though I am not sure it fixes the issue it probably makes things a bit better." [puppet] - 10https://gerrit.wikimedia.org/r/505847 (owner: 10Paladox) [19:16:34] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [19:17:52] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 5 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] [19:19:09] thcipriani: awesome thanks! [19:19:14] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [19:22:31] (03CR) 10Herron: [C: 04-1] "> The alternate approach would be to have an $enable_kafaka_shipping" [puppet] - 10https://gerrit.wikimedia.org/r/505817 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [19:25:41] (03PS1) 10Ottomata: Use proper port for schema.svc in refine_job [puppet] - 10https://gerrit.wikimedia.org/r/505887 [19:26:41] (03CR) 10Ottomata: [C: 03+2] Use proper port for schema.svc in refine_job [puppet] - 10https://gerrit.wikimedia.org/r/505887 (owner: 10Ottomata) [19:26:54] (03CR) 10Alex Monk: "right so bastion_hosts actually doesn't get used in the template but is used in the manifest... wonder if it should be left as a separate " [puppet] - 10https://gerrit.wikimedia.org/r/505793 (owner: 10Alex Monk) [19:29:15] 10Operations, 10monitoring, 10Patch-For-Review: google safe browsing icinga checks sporadic UNKNOWN due to 404 - https://phabricator.wikimedia.org/T216985 (10Dzahn) 05Open→03Resolved Check and script etc have been removed from the repo and Icinga web UI. [19:30:40] ACKNOWLEDGEMENT - puppet last run on bast5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn filippo prometheus [19:32:32] PROBLEM - Check systemd state on cloudstore1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:32:46] !log webperf* - running puppet to git pull docroot [19:32:52] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decom rigel.frack.codfw.wmnet - https://phabricator.wikimedia.org/T202535 (10RobH) a:03Papaul [19:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:36] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 20 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [19:33:54] (03PS1) 10CDanis: codfw decom: halve non-object weights and 2/3rds object weights [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505888 (https://phabricator.wikimedia.org/T221068) [19:34:42] PROBLEM - puppet last run on cloudnet2003-dev is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [19:34:46] PROBLEM - puppet last run on cloudnet2002-dev is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-keystoneclient],Package[python3-novaclient],Package[python3-glanceclient],Package[python3-openstackclient] [19:35:54] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:36:40] PROBLEM - puppet last run on cloudstore1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/etc/nsswitch.conf] [19:37:02] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:37:13] (03CR) 10CDanis: "cdanis@evebox.local ~/work/gits/swift-ring % make" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505888 (https://phabricator.wikimedia.org/T221068) (owner: 10CDanis) [19:40:09] (03PS1) 10Bstorm: Revert "cloudstore: add python3 clientpackages for all" [puppet] - 10https://gerrit.wikimedia.org/r/505889 [19:40:22] (03PS2) 10Bstorm: Revert "cloudstore: add python3 clientpackages for all" [puppet] - 10https://gerrit.wikimedia.org/r/505889 [19:41:00] (03CR) 10Bstorm: [C: 03+2] Revert "cloudstore: add python3 clientpackages for all" [puppet] - 10https://gerrit.wikimedia.org/r/505889 (owner: 10Bstorm) [19:43:34] RECOVERY - puppet last run on cloudvirtan1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:43:36] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:44:06] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:44:06] RECOVERY - puppet last run on cloudvirt1025 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:45:02] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:45:08] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:45:08] RECOVERY - puppet last run on cloudvirt1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:45:21] (03PS3) 10Andrew Bogott: Pool cloudvirt1005 and 1006 [puppet] - 10https://gerrit.wikimedia.org/r/505862 (https://phabricator.wikimedia.org/T221049) [19:46:19] (03CR) 10Andrew Bogott: [C: 03+2] Pool cloudvirt1005 and 1006 [puppet] - 10https://gerrit.wikimedia.org/r/505862 (https://phabricator.wikimedia.org/T221049) (owner: 10Andrew Bogott) [19:47:10] RECOVERY - puppet last run on cloudvirt1023 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:47:40] (03PS1) 10Bstorm: cloudstore: set openstack client version to stretch/newton [puppet] - 10https://gerrit.wikimedia.org/r/505892 [19:48:44] RECOVERY - puppet last run on cloudvirt1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:49:01] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10RobH) [19:49:24] RECOVERY - puppet last run on cloudvirt1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:49:38] RECOVERY - puppet last run on cloudvirt1013 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [19:50:02] RECOVERY - puppet last run on cloudvirt1018 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:50:56] RECOVERY - puppet last run on cloudvirt1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:51:22] RECOVERY - puppet last run on cloudvirt1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:52:12] (03PS2) 10Bstorm: cloudstore: set openstack client version to stretch/newton [puppet] - 10https://gerrit.wikimedia.org/r/505892 [19:52:48] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:53:04] (03CR) 10Bstorm: [C: 03+2] cloudstore: set openstack client version to stretch/newton [puppet] - 10https://gerrit.wikimedia.org/r/505892 (owner: 10Bstorm) [19:54:00] RECOVERY - puppet last run on cloudservices1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:55:26] RECOVERY - puppet last run on cloudvirtan1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:55:32] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10RobH) Please note that asw-b-codfw has the following interface: ge-5/0/15 up up labcontrol2001 This is act... [19:55:58] RECOVERY - puppet last run on cloudvirt1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:56:32] RECOVERY - puppet last run on cloudvirt1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:56:36] RECOVERY - puppet last run on cloudservices1004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:56:40] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:57:04] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:57:12] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [19:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:17] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:30] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labtestweb2001.wikimedia.org... [19:58:12] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10RobH) [19:58:22] RECOVERY - puppet last run on cloudvirt1026 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:58:30] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:58:52] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:58:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:59:16] RECOVERY - puppet last run on labtestvirt2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:59:20] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:59:20] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:59:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:59:36] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:59:38] RECOVERY - puppet last run on cloudvirt1002 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [19:59:47] hum [19:59:52] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:59:54] RECOVERY - puppet last run on cloudvirt1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:59:56] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:00:00] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:00:28] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:00:46] (03PS1) 10RobH: decommission labtestweb2001 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/505894 (https://phabricator.wikimedia.org/T218024) [20:00:48] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:00:50] RECOVERY - puppet last run on cloudvirt1027 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:01:02] RECOVERY - puppet last run on labpuppetmaster1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [20:01:06] RECOVERY - puppet last run on cloudnet2003-dev is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:01:40] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:01:45] (03CR) 10RobH: [C: 03+2] decommission labtestweb2001 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/505894 (https://phabricator.wikimedia.org/T218024) (owner: 10RobH) [20:01:56] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:02:30] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:02:50] RECOVERY - puppet last run on cloudvirt1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:03:04] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [20:03:26] RECOVERY - puppet last run on cloudvirt1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:03:54] RECOVERY - puppet last run on cloudvirt1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:04:30] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:05:24] (03PS1) 10RobH: decom of labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/505895 (https://phabricator.wikimedia.org/T218024) [20:05:36] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:05:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:05:52] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:06:02] RECOVERY - puppet last run on cloudvirt1003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:06:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:06:14] (03CR) 10RobH: [C: 03+2] decom of labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/505895 (https://phabricator.wikimedia.org/T218024) (owner: 10RobH) [20:06:24] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:06:28] RECOVERY - puppet last run on cloudnet2002-dev is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:06:34] RECOVERY - puppet last run on cloudvirt1019 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:06:36] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:07:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:07:16] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [20:07:32] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:07:32] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10RobH) a:05RobH→03Papaul [20:07:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:07:44] RECOVERY - puppet last run on cloudvirtan1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:09:16] RECOVERY - puppet last run on cloudnet1004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [20:09:46] PROBLEM - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: sgebastion class instances not spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:10:12] RECOVERY - puppet last run on cloudvirt1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:11:00] RECOVERY - puppet last run on cloudcontrol1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:11:08] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:11:46] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:11:54] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:12:06] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [20:12:14] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:12:20] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:12:25] (03PS1) 10CDanis: Revert "wikipedia.org CNAME experiment: 4H CNAMEs" [dns] - 10https://gerrit.wikimedia.org/r/505896 (https://phabricator.wikimedia.org/T208263) [20:12:26] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:12:34] RECOVERY - puppet last run on cloudnet1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:13:14] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:13:32] RECOVERY - puppet last run on cloudvirt1020 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:14:34] RECOVERY - puppet last run on cloudvirt1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:17:03] (03PS2) 10CDanis: Revert "wikipedia.org CNAME experiment: 4H CNAMEs" [dns] - 10https://gerrit.wikimedia.org/r/505896 (https://phabricator.wikimedia.org/T208263) [20:18:44] (03CR) 10CDanis: [C: 03+2] Revert "wikipedia.org CNAME experiment: 4H CNAMEs" [dns] - 10https://gerrit.wikimedia.org/r/505896 (https://phabricator.wikimedia.org/T208263) (owner: 10CDanis) [20:23:06] seeing 500 errors on meta.wikimedia.org for requests to load.php [20:23:28] 500 or 503, kostajh? [20:23:35] 500 [20:23:46] cdanis: on https://meta.wikimedia.org/wiki/Schema:HomepageVisit [20:23:51] WFM [20:24:09] on all pages actually. hrm [20:26:04] !log dropping disused restbase keyspaces -- T221530 [20:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:09] T221530: Drop old mobile-sections keyspaces - https://phabricator.wikimedia.org/T221530 [20:26:33] cdanis: the error is " require_once(/srv/mediawiki/docroot/wikimedia.org/w/../multiversion/MWMultiVersion.php): File not found " [20:28:12] (03PS1) 10CDanis: Revert "wikipedia.org: test with zone-local CNAME->DYNA" [dns] - 10https://gerrit.wikimedia.org/r/505901 (https://phabricator.wikimedia.org/T208263) [20:28:33] (03CR) 10Ottomata: "Yeah bastion sounds like a special case, or perhaps should be in its own class." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505793 (owner: 10Alex Monk) [20:30:53] (03PS11) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [20:31:36] (03CR) 10CRusnov: coherence report: General improvements and rack checks (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [20:32:47] (03CR) 10CDanis: [C: 03+2] Revert "wikipedia.org: test with zone-local CNAME->DYNA" [dns] - 10https://gerrit.wikimedia.org/r/505901 (https://phabricator.wikimedia.org/T208263) (owner: 10CDanis) [20:34:15] Reedy / cdanis I've filed T221702 as UBN since I consistently get this now, but please downgrade its urgency if your prefer [20:34:16] T221702: 500 error on load.php requests to meta.wikimedia.org - https://phabricator.wikimedia.org/T221702 [20:34:22] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10CDanis) `authdns-update` complete as of ~20:33:56 UTC. [20:34:46] kostajh: Can you see which host(s) it's trying to go through? [20:35:17] Krinkle: ^ docroot fail... [20:35:39] ugh [20:35:46] x-wikimedia-debug extension. very sorry! [20:36:02] Shouldn't be broken though [20:36:20] mwdebug1001 is what I was using [20:36:24] Which mwdebug host? as people were complaining about missing CSS when doing the SDC testing [20:36:26] That matchs [20:36:28] * Reedy looks [20:36:41] kostajh: seems to be specific to xwd+mwdebug1001+hhvm [20:37:05] I recall jo.e removing something relating to docroot for php.ini [20:38:23] the other three hosts and php7 don't exhibit it [20:42:38] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@51b4728]: Deploy new Updater fix for cnstraints (T221407) [20:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:43] T221407: Unrecognized subject messages in Updater - https://phabricator.wikimedia.org/T221407 [20:43:30] !log updating designate pools on cloudservices1003 and 1004 using eqiad1_pool_config.yml template from the puppet repo [20:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:23] PROBLEM - High lag on wdqs1003 is CRITICAL: 5454 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:47:45] (03PS2) 10Muehlenhoff: Include grub::defaults unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/505886 (https://phabricator.wikimedia.org/T140100) [20:54:36] (03PS1) 10Nray: Remove wikibase sameAs A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) [20:55:41] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@51b4728]: Deploy new Updater fix for cnstraints (T221407) (duration: 13m 03s) [20:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:46] T221407: Unrecognized subject messages in Updater - https://phabricator.wikimedia.org/T221407 [20:58:15] 10Operations, 10Traffic, 10cloud-services-team (Kanban): Update RIPE about changes in WMCS auth servers - https://phabricator.wikimedia.org/T221531 (10ayounsi) 05Open→03Resolved a:03ayounsi Updated. `lang=diff @@ -7,7 +7,7 @@ zone-c: WMF-RIPE -nserver: labs-ns0.wikimedia.org -nserver:... [20:58:19] (03PS2) 10Nray: Remove wikibase sameAs A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) [21:01:53] 10Operations, 10Cloud-VPS, 10DNS, 10Traffic, 10cloud-services-team (Kanban): Inconsistent lists of labs-ns* nameservers - https://phabricator.wikimedia.org/T205344 (10Krenair) 05Open→03Resolved a:03Andrew With the shutting down of labs-ns* looming, to make {T221531} possible @andrew made a change w... [21:02:15] (03PS1) 10CRusnov: Minor improvements to PuppetDB report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506001 (https://phabricator.wikimedia.org/T220422) [21:06:25] (03PS1) 10Dzahn: admins: add shell account for Willy Pao and add to datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/506003 (https://phabricator.wikimedia.org/T221142) [21:07:48] (03CR) 10Niedzielski: Remove wikibase sameAs A/B test config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray) [21:09:53] 10Operations, 10ops-eqiad, 10netops: Replace eqiad mgmt switches with EX4200s - https://phabricator.wikimedia.org/T213128 (10ayounsi) Filed T221675 for the aggregation switches. I agree, it doesn't make sens to re-purpose such old gear into "production". I guess we're down to: 1/ Buy new EX2300, more expe... [21:14:41] hmm gerrit's having another thread overload [21:14:42] https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=activeThreads [21:15:03] which someone just reported they coulden't reviewers. [21:16:36] (03PS3) 10Nray: Remove wikibase sameAs A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) [21:17:41] (03CR) 10Nray: ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray) [21:18:16] I've confirmed that the problem is indeed happening. [21:18:22] as i cannot load a group page [21:20:02] 10Operations, 10DC-Ops, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) [21:20:08] damn. [21:20:28] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) [21:21:48] thcipriani can we lower the http timeout? [21:22:14] yes, if there is an SRE around who can merge that patch [21:22:20] ^ mutante [21:22:37] i can if needbe [21:22:43] paladox: patchset link? [21:23:07] chaomodus: could we merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/504912/ + paladox's http timeout merged? [21:23:31] that one looks good [21:23:36] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/505847/ [21:23:40] ^ that one [21:23:49] okay will merge them [21:23:59] (03PS2) 10CRusnov: gerrit: increase projects cache [puppet] - 10https://gerrit.wikimedia.org/r/504912 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [21:24:08] chaomodus: thank you :) [21:24:22] ^^ [21:24:44] please hold for rebase ahah [21:25:25] (03CR) 10CRusnov: [C: 03+2] gerrit: increase projects cache [puppet] - 10https://gerrit.wikimedia.org/r/504912 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [21:25:43] (03PS3) 10CRusnov: Gerrit: Lower httpd.maxWait to 2mins [puppet] - 10https://gerrit.wikimedia.org/r/505847 (owner: 10Paladox) [21:28:31] (03CR) 10CRusnov: [C: 03+2] Gerrit: Lower httpd.maxWait to 2mins [puppet] - 10https://gerrit.wikimedia.org/r/505847 (owner: 10Paladox) [21:29:31] kay merges done and puppet-merged [21:29:47] well puppet merge will be done in a sec i see it's doing all the subsidiary puppets [21:31:05] thanks chaomodus! [21:31:09] chaomodus: k, lemme know when I can puppet run on cobalt, then I'll kick gerrit to pickup changes [21:31:23] I think you're good to go thcipriani [21:31:53] chaomodus: thanks [21:33:29] !log restarting gerrit to pickup config changes [21:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:13] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [21:38:43] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [21:39:37] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [21:39:49] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 4 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [21:42:41] (03PS1) 10CRusnov: Cleanups to the oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 [21:42:47] 10Operations, 10ops-codfw, 10decommission: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10RobH) [21:43:10] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [21:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:16] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [21:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:22] 10Operations, 10ops-codfw, 10decommission: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `db2033.codfw.wmnet` - db2033.codfw.wmnet - Removed from Puppet master and PuppetDB - Downtimed host on... [21:45:03] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [21:45:32] transient puppet failures fwiw [21:45:44] (03CR) 10jerkins-bot: [V: 04-1] Cleanups to the oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 (owner: 10CRusnov) [21:46:38] (03CR) 10Dzahn: [C: 03+1] "we already use 'strong' for: librenms, tendril, netbox, debmonitor, icinga.. though gerrit has more external clients than those of course" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [21:47:13] (03PS2) 10CRusnov: Cleanups to the oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 (https://phabricator.wikimedia.org/T220422) [21:48:01] (03CR) 10jerkins-bot: [V: 04-1] Cleanups to the oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [21:48:59] jouncebot: now [21:48:59] No deployments scheduled for the next 1 hour(s) and 11 minute(s) [21:49:01] jouncebot: next [21:49:01] In 1 hour(s) and 10 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190423T2300) [21:53:59] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/ORES/includes/Specials/SpecialORESModels.php: T221696 (duration: 00m 55s) [21:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:04] T221696: ORESModels has a fatal error - https://phabricator.wikimedia.org/T221696 [21:55:47] (03PS1) 10RobH: db2033 decom [puppet] - 10https://gerrit.wikimedia.org/r/506012 (https://phabricator.wikimedia.org/T220070) [21:57:07] (03CR) 10Dzahn: "> For whatever reason it isn't though; I checked before leaving the comment." [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [22:00:13] (03CR) 10RobH: [C: 03+2] db2033 decom [puppet] - 10https://gerrit.wikimedia.org/r/506012 (https://phabricator.wikimedia.org/T220070) (owner: 10RobH) [22:00:46] (03Abandoned) 10Dzahn: dumps: switch phab1001->phab1003 as phab dumps source [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [22:01:02] (03PS4) 10Dzahn: switch phabricator from phab1001 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/437620 (https://phabricator.wikimedia.org/T196019) [22:01:36] (03PS1) 10RobH: decom 2033 prod dns [dns] - 10https://gerrit.wikimedia.org/r/506014 (https://phabricator.wikimedia.org/T220070) [22:02:32] (03CR) 10RobH: [C: 03+2] decom 2033 prod dns [dns] - 10https://gerrit.wikimedia.org/r/506014 (https://phabricator.wikimedia.org/T220070) (owner: 10RobH) [22:05:14] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [22:05:16] 10Operations, 10ops-codfw, 10decommission: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10RobH) [22:05:25] (03CR) 10MSantos: [C: 03+1] maps: align tilerator CPU usage across all nodes [puppet] - 10https://gerrit.wikimedia.org/r/505819 (owner: 10Gehel) [22:05:49] 10Operations, 10ops-codfw, 10decommission: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10RobH) a:05RobH→03Papaul [22:06:07] (03PS4) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS simply again [puppet] - 10https://gerrit.wikimedia.org/r/504968 (https://phabricator.wikimedia.org/T221721) [22:06:29] (03PS1) 10CRusnov: Minor improvements to management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506017 [22:07:00] (03CR) 10jerkins-bot: [V: 04-1] Minor improvements to management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506017 (owner: 10CRusnov) [22:07:21] (03PS5) 10Dzahn: switch phabricator from phab1001 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/437620 (https://phabricator.wikimedia.org/T196019) [22:07:50] (03PS5) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS simply again [puppet] - 10https://gerrit.wikimedia.org/r/504968 (https://phabricator.wikimedia.org/T221721) [22:08:06] (03PS1) 10CRusnov: Minor improvements to management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506018 (https://phabricator.wikimedia.org/T220422) [22:08:32] uh [22:08:38] threads getting higher again :( [22:08:47] oh [22:08:48] nvm [22:08:51] grrr, wrong graph [22:08:56] was reading cpu [22:09:10] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:09:23] (03PS1) 10BryanDavis: wmcs: Move service DNS from pdns to Designate [puppet] - 10https://gerrit.wikimedia.org/r/506019 (https://phabricator.wikimedia.org/T216451) [22:09:34] RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:09:51] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Move service DNS from pdns to Designate [puppet] - 10https://gerrit.wikimedia.org/r/506019 (https://phabricator.wikimedia.org/T216451) (owner: 10BryanDavis) [22:10:23] (03PS3) 10CRusnov: Cleanups to the oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 (https://phabricator.wikimedia.org/T220422) [22:10:29] (03CR) 10Bstorm: [C: 03+2] puppetdb: adapt the module so it works on Cloud VPS simply again [puppet] - 10https://gerrit.wikimedia.org/r/504968 (https://phabricator.wikimedia.org/T221721) (owner: 10Bstorm) [22:11:33] robh: ok to merge your decom thing? [22:11:44] on puppetmaster? [22:11:47] sorry, yes [22:12:18] Done [22:12:23] (03PS2) 10CRusnov: Minor improvements to PuppetDB report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506001 (https://phabricator.wikimedia.org/T220422) [22:12:34] (03CR) 10Alex Monk: [C: 04-1] "should probably rename the script, error inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506019 (https://phabricator.wikimedia.org/T216451) (owner: 10BryanDavis) [22:13:47] (03PS2) 10BryanDavis: wmcs: Move service DNS from pdns to Designate [puppet] - 10https://gerrit.wikimedia.org/r/506019 (https://phabricator.wikimedia.org/T216451) [22:13:57] (03Abandoned) 10CRusnov: Minor improvements to management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506018 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [22:14:47] (03PS2) 10CRusnov: Minor improvements to management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506017 (https://phabricator.wikimedia.org/T220422) [22:16:55] (03CR) 10BryanDavis: "> should probably rename the script" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506019 (https://phabricator.wikimedia.org/T216451) (owner: 10BryanDavis) [22:19:08] (03PS4) 10CRusnov: Cleanups to the oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506008 (https://phabricator.wikimedia.org/T220422) [22:21:06] (03PS3) 10BryanDavis: wmcs: Move service DNS from pdns to Designate [puppet] - 10https://gerrit.wikimedia.org/r/506019 (https://phabricator.wikimedia.org/T216451) [22:22:16] (03PS4) 10BryanDavis: wmcs: Move service DNS from pdns to Designate [puppet] - 10https://gerrit.wikimedia.org/r/506019 (https://phabricator.wikimedia.org/T216451) [22:24:40] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10Dzahn) There is still an open subtask at T220575 [22:33:52] !log push firewall rule to pfw3-codfw - T221475 [22:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:57] T221475: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 [22:35:49] !log push firewall rule to pfw3-eqiad - T221475 [22:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:58] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Dzahn) @jijiki @robh Is this happening and i should skip mw1298 on T192457? Or is it going to be another host or none? [22:53:18] (03PS1) 10Zoranzoki21: Add "autoreviewer" to $wgRestrictionLevels on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506024 (https://phabricator.wikimedia.org/T221521) [22:54:58] (03PS2) 10Zoranzoki21: Add "autoreviewer" to $wgRestrictionLevels on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506024 (https://phabricator.wikimedia.org/T221521) [22:55:11] I'll SWAT. [23:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190423T2300). [23:00:04] rxy, kostajh, and James_F: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:13] rxy: You here? [23:00:18] hello [23:00:22] jouncebot forgot on me [23:00:29] Hey kostajh; I assume you want your two together? [23:00:42] James_F: up to you, I could verify them separately [23:00:47] I have one patch too and I added it before few minutes [23:00:51] But both together is fine [23:00:55] But I no know why jouncebot forgot on me [23:01:00] kostajh: Right. [23:01:12] Zoranzoki21: It caches looks for a while. [23:01:20] Ok is [23:02:42] Reedy: Are you online? [23:02:44] (03CR) 10Jforrester: [C: 03+2] Add "autoreviewer" to $wgRestrictionLevels on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506024 (https://phabricator.wikimedia.org/T221521) (owner: 10Zoranzoki21) [23:03:08] James_F: You are so fast :)= [23:03:13] * James_F grins. [23:03:19] (03Merged) 10jenkins-bot: Add "autoreviewer" to $wgRestrictionLevels on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506024 (https://phabricator.wikimedia.org/T221521) (owner: 10Zoranzoki21) [23:04:00] OK, Zoranzoki21, you able to test on mwdebug1002? [23:04:10] James_F: Yes [23:04:21] It's live there. Thanks! [23:06:08] Testing... [23:07:30] Ok is [23:07:50] (03PS3) 10Dzahn: site/mw/conftool: assign spare mw1297 as API server [puppet] - 10https://gerrit.wikimedia.org/r/504791 (https://phabricator.wikimedia.org/T192457) [23:07:53] Zoranzoki21: Good to deploy? [23:08:06] James_F: Yes, I mean irt [23:08:07] (03CR) 10jenkins-bot: Add "autoreviewer" to $wgRestrictionLevels on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506024 (https://phabricator.wikimedia.org/T221521) (owner: 10Zoranzoki21) [23:08:07] *it [23:08:31] OK, syncing now. [23:09:05] * James_F twiddles thumbs, waiting for zuul/jenkins. [23:09:21] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T221521 Add autoreviewer to wgRestrictionLevels on ptwikinews (duration: 00m 54s) [23:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:26] T221521: Autoreview protect on ptwikinews - https://phabricator.wikimedia.org/T221521 [23:09:26] Zoranzoki21: Done. [23:09:47] Thanks, looks good [23:13:23] James_F: As you have access for +2 at mediawiki/core can you merge https://gerrit.wikimedia.org/r/505429 and https://gerrit.wikimedia.org/r/505431 if it is ok? [23:14:41] Zoranzoki21: I'll review later. Thanks! [23:15:41] James_F: Ok, thanks [23:16:38] oh dear, jenkins [23:17:40] kostajh: Inorite? [23:20:18] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.Target.js: SWAT T221668 VisualEditor: Restore external paste sanitization of DOM elements (duration: 00m 55s) [23:20:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission phab1002/WMF4727 - https://phabricator.wikimedia.org/T221391 (10Dzahn) [23:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:23] T221668: Pasting tables from Google sheets pollutes the document with data-sheets- attributes - https://phabricator.wikimedia.org/T221668 [23:21:24] 10Operations, 10monitoring, 10Patch-For-Review: prometheus1004 /srv/prometheus/ops almost full - https://phabricator.wikimedia.org/T220326 (10Dzahn) 05Open→03Resolved a:03Dzahn [23:23:18] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/languages/Language.php: SWAT T219728 Add support for new Japanese era name 'Reiwa' (duration: 00m 52s) [23:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:23] T219728: Support for new Japanese era name "Reiwa" - https://phabricator.wikimedia.org/T219728 [23:23:49] Okie-dokie. [23:25:49] !log generating mcrouter certs for appservers, added mw1297.eqiad.wmnet (T192457) [23:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:55] T192457: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 [23:28:04] kostajh: Both patches (finally!) live on mwdebug1002. [23:28:36] James_F: G'morning [23:29:10] (I'm wokeup now) [23:29:11] rxy: Hey. I've deployed it to wmf.1; it looked fine in my testing, but if you could have a second check that would be great. [23:29:24] James_F: ok, checking [23:29:50] https://ja.wikipedia.org/wiki/%E5%88%A9%E7%94%A8%E8%80%85:Rxy/%E3%82%B5%E3%83%B3%E3%83%89%E3%83%9C%E3%83%83%E3%82%AF%E3%82%B9 <- It seems good. Thanks ! [23:30:09] rxy: Happy to help. Thanks for getting us compliant with the new era! [23:31:08] * rxy go to bed again [23:31:12] mwdebug1002 is taking ages to load :\ [23:31:28] "Error: 500, Internal Server Error" [23:31:35] trying again... [23:31:39] there we go [23:31:53] kostajh: Yeah, it's not the fastest VM and the new code means it drops the PHP cache. [23:35:18] James_F: looks OK. Having trouble verifying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/506006 but I think it should be good to go live [23:35:20] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1144 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:35:25] James_F: and the other one is fine too [23:35:58] kostajh: Excellent, will sync. [23:38:03] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/GrowthExperiments/includes/EventLogging/SpecialHomepageLogger.php: SWAT GrowthExperiments: Fix EventLogging errors (duration: 00m 53s) [23:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:03] (03PS4) 10Dzahn: site/mw/conftool: assign spare mw1297 as API server [puppet] - 10https://gerrit.wikimedia.org/r/504791 (https://phabricator.wikimedia.org/T192457) [23:39:09] thanks James_F! have a nice evening [23:39:36] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/GrowthExperiments/modules/homepage/ext.growthExperiments.Homepage.Logger.js: SWAT GrowthExperiments: Fix validation errors due to state='' (duration: 00m 53s) [23:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:45] kostajh: And you! [23:39:52] OK, SWAT done. [23:39:55] as someone checked wit ops if theyre willing to make mwdebug* more powerful? [23:39:57] with [23:45:56] (03CR) 10Dzahn: [C: 03+2] site/mw/conftool: assign spare mw1297 as API server [puppet] - 10https://gerrit.wikimedia.org/r/504791 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [23:48:16] Krenair: yea. https://phabricator.wikimedia.org/T203625 https://phabricator.wikimedia.org/T215368 [23:49:29] also https://phabricator.wikimedia.org/T212955 [23:49:44] maybe reopen the last one if it needs even more CPU [23:50:06] per ""if we need more resources feel free to reopen." [23:51:09] !log mw1297 - initial puppet run - will show up in Icinga in a little while but not pooled yet.. all the things are being installed right now [23:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log