[00:02:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] netboot: add phab1003 to partman [puppet] - 10https://gerrit.wikimedia.org/r/505040 (https://phabricator.wikimedia.org/T221389) (owner: 10Dzahn)
[00:05:15] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[00:05:25] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update lag - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[00:26:25] <wikibugs>	 (03PS1) 10Dzahn: site: add phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/505051 (https://phabricator.wikimedia.org/T221389)
[00:31:38] <wikibugs>	 (03PS2) 10Dzahn: site: add phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/505051 (https://phabricator.wikimedia.org/T221389)
[00:32:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] site: add phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/505051 (https://phabricator.wikimedia.org/T221389) (owner: 10Dzahn)
[00:35:33] <wikibugs>	 (03PS3) 10Dzahn: site: add phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/505051 (https://phabricator.wikimedia.org/T221389)
[00:35:57] <wikibugs>	 (03CR) 10Alex Monk: "Thanks Filippo. Seeing as these files are already live, what's the process for getting this commit merged?" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503544 (owner: 10Alex Monk)
[00:36:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] site: add phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/505051 (https://phabricator.wikimedia.org/T221389) (owner: 10Dzahn)
[00:36:43] <wikibugs>	 (03PS4) 10Dzahn: site: add phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/505051 (https://phabricator.wikimedia.org/T221389)
[00:37:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: add phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/505051 (https://phabricator.wikimedia.org/T221389) (owner: 10Dzahn)
[00:39:33] <wikibugs>	 (03CR) 10Alex Monk: [C: 04-2] "needs porting to new repo" [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (https://phabricator.wikimedia.org/T207372) (owner: 10Alex Monk)
[00:41:37] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['phab1003.eqiad.wmnet'] ` The log can be found in `/var/log/wmf...
[01:21:55] <wikibugs>	 (03PS7) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[01:22:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[01:27:35] <icinga-wm>	 PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:42:47] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['phab1003.eqiad.wmnet'] `  Of which those **FAILED**: ` ['phab1003.eqiad.wmnet'] `
[01:59:19] <icinga-wm>	 RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[02:04:19] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "If they're already live, just merge." [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503544 (owner: 10Alex Monk)
[02:04:23] <wikibugs>	 (03CR) 10CDanis: [V: 03+2 C: 03+2] deployment-prep: Update with live files [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503544 (owner: 10Alex Monk)
[02:06:54] <wikibugs>	 (03CR) 10Alex Monk: "Thanks for merging. Merge access in this repository is restricted." [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503544 (owner: 10Alex Monk)
[02:47:57] <wikibugs>	 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) The secondary schema that was requested was addressed in T202596. I think at one point the schema was submitted to @Maro...
[03:22:09] <icinga-wm>	 PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:42:25] <icinga-wm>	 PROBLEM - puppet last run on analytics1076 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:45:07] <wikibugs>	 (03PS1) 10Alex Monk: deployment-prep: Use new poolcounter instance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505059
[03:53:11] <icinga-wm>	 PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:53:57] <icinga-wm>	 RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[03:54:21] <wikibugs>	 (03PS1) 10Alex Monk: deployment-prep: Use new ms-fe host. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505060
[04:08:57] <icinga-wm>	 RECOVERY - puppet last run on analytics1076 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[04:10:59] <icinga-wm>	 PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:11:21] <icinga-wm>	 PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:19:45] <icinga-wm>	 RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[04:20:19] <icinga-wm>	 PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:37:27] <icinga-wm>	 RECOVERY - puppet last run on rdb1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:43:07] <icinga-wm>	 RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[04:45:29] <icinga-wm>	 PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:46:47] <icinga-wm>	 RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[04:56:42] <wikibugs>	 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) The number of puppet.git cherry-picks on cloudinfra-internal-puppetmaster is now 0, there's just the tw...
[05:16:02] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) I have increased the priority cause s4 master is having memory errors again and needs to be replaced as soon as we can
[05:16:55] <icinga-wm>	 PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:17:17] <icinga-wm>	 RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[05:26:41] <icinga-wm>	 PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:30:31] <icinga-wm>	 PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:40:32] <wikibugs>	 (03CR) 10Alex Monk: "I think actually what I want to do is avoid reusing the existing zone numbers, make two new zones, each with one of the new instances in. " [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503714 (owner: 10Alex Monk)
[05:48:45] <icinga-wm>	 RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[05:57:01] <icinga-wm>	 RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:58:29] <icinga-wm>	 RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:00:02] <wikibugs>	 (03PS2) 10Alex Monk: deployment-prep: Add stretch storage hosts [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503714
[06:03:35] <wikibugs>	 10Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670 (10Marostegui) 05Open→03Resolved Thanks! We are tracking those at T208323 and as we have many - we are waiting for them to fully fail before replacing (as sometimes it takes months) so closing this agai...
[06:13:17] <wikibugs>	 (03PS1) 10Elukey: role::druid::analytics::worker: set stricter timeouts and jvm settings [puppet] - 10https://gerrit.wikimedia.org/r/505062
[06:33:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::druid::analytics::worker: set stricter timeouts and jvm settings [puppet] - 10https://gerrit.wikimedia.org/r/505062 (owner: 10Elukey)
[06:39:26] <elukey>	 !log roll restart of druid daemons on druid100[1-3] to pick up new jvm settings
[06:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:21] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mcrouter_generate_certs: fix puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/505064
[07:02:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mcrouter_generate_certs: fix puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/505064 (owner: 10Giuseppe Lavagetto)
[07:04:34] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mcrouter_generate_certs: fix puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/505064
[07:05:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mcrouter_generate_certs: fix puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/505064 (owner: 10Giuseppe Lavagetto)
[07:23:46] <wikibugs>	 (03PS3) 10DCausse: Add a new extension point SshExecuteCommandInterceptor [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/502487
[07:24:20] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mcrouter_generate_certs: fix puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/505064
[07:25:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mcrouter_generate_certs: fix puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/505064 (owner: 10Giuseppe Lavagetto)
[07:33:29] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 89 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[07:35:45] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 123 probes of 448 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[07:38:49] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 15 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[07:41:03] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 4 probes of 448 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[08:05:17] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10Marostegui) Thanks for letting us know! This master will be replaced once the hosts at {T211613} are racked and installed.
[08:07:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] conftool: remove 2.7 metadata, add 3.7 [software/conftool] - 10https://gerrit.wikimedia.org/r/504980 (owner: 10CDanis)
[08:07:17] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) db2047 has another disk failed: `       logicaldrive 1 (3.3 TB, RAID 1+0, OK)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)       physicaldrive 1I:1...
[08:09:46] <wikibugs>	 (03Merged) 10jenkins-bot: conftool: remove 2.7 metadata, add 3.7 [software/conftool] - 10https://gerrit.wikimedia.org/r/504980 (owner: 10CDanis)
[08:16:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] gerrit: update 'accountPattern' for LDAP account locking [puppet] - 10https://gerrit.wikimedia.org/r/504981 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[08:16:08] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: gerrit: update 'accountPattern' for LDAP account locking [puppet] - 10https://gerrit.wikimedia.org/r/504981 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[08:32:39] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Support affinity in all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/505185
[08:33:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mcrouter_generate: add ability to add manifests for new servers [puppet] - 10https://gerrit.wikimedia.org/r/505187
[08:45:08] <akosiaris>	 !log restart gerrit to pick up https://gerrit.wikimedia.org/r/504981
[08:45:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:43] <icinga-wm>	 PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas]
[09:06:34] <wikibugs>	 (03PS1) 10Hashar: gerrit: enable AccountDeactivator [puppet] - 10https://gerrit.wikimedia.org/r/505218 (https://phabricator.wikimedia.org/T218654)
[09:08:03] <wikibugs>	 (03CR) 10Hashar: "The logging config is a copy paste from the plugin_log sections." [puppet] - 10https://gerrit.wikimedia.org/r/505218 (https://phabricator.wikimedia.org/T218654) (owner: 10Hashar)
[09:15:18] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mcrouter_generate: add ability to add manifests for new servers [puppet] - 10https://gerrit.wikimedia.org/r/505187
[09:15:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mcrouter_generate: add an audit functionality [puppet] - 10https://gerrit.wikimedia.org/r/505219
[09:20:11] <icinga-wm>	 RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[09:22:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] add a .gitreview [software/conftool] - 10https://gerrit.wikimedia.org/r/504923 (owner: 10CDanis)
[09:25:12] <wikibugs>	 (03Merged) 10jenkins-bot: add a .gitreview [software/conftool] - 10https://gerrit.wikimedia.org/r/504923 (owner: 10CDanis)
[09:25:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mcrouter_generate: add ability to add manifests for new servers [puppet] - 10https://gerrit.wikimedia.org/r/505187 (owner: 10Giuseppe Lavagetto)
[09:28:24] <wikibugs>	 (03PS1) 10Petar.petkovic: Use higher unmodified MT threshold for Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505220 (https://phabricator.wikimedia.org/T221353)
[09:32:21] <wikibugs>	 (03CR) 10Paladox: "@hashar should we also set https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#auth.autoUpdateAccountActiveStatus ?" [puppet] - 10https://gerrit.wikimedia.org/r/505218 (https://phabricator.wikimedia.org/T218654) (owner: 10Hashar)
[09:32:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mcrouter_generate: add an audit functionality [puppet] - 10https://gerrit.wikimedia.org/r/505219 (owner: 10Giuseppe Lavagetto)
[09:44:37] <wikibugs>	 (03CR) 10Paladox: "Actually setting ^^ is required for the AccountDeactivator." [puppet] - 10https://gerrit.wikimedia.org/r/505218 (https://phabricator.wikimedia.org/T218654) (owner: 10Hashar)
[09:53:01] <wikibugs>	 (03PS2) 10Hashar: gerrit: enable AccountDeactivator [puppet] - 10https://gerrit.wikimedia.org/r/505218 (https://phabricator.wikimedia.org/T218654)
[09:56:11] <wikibugs>	 (03CR) 10Hashar: "Eeek indeed, I have missed Ariel comment on the task and indeed in Gerrit code:" [puppet] - 10https://gerrit.wikimedia.org/r/505218 (https://phabricator.wikimedia.org/T218654) (owner: 10Hashar)
[09:56:20] <hashar>	 paladox: thank you :]
[09:57:42] <paladox>	 Your welcome :)
[09:58:24] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] gerrit: enable AccountDeactivator [puppet] - 10https://gerrit.wikimedia.org/r/505218 (https://phabricator.wikimedia.org/T218654) (owner: 10Hashar)
[10:07:37] <xSavitar>	 hashar: Hey, wanna land this, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LiquidThreads/+/505221?
[10:07:44] <xSavitar>	 It's blocking the phan patch
[10:21:37] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 84 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[10:22:27] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 114 probes of 448 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[10:23:51] <_joe_>	 uhm
[10:25:16] <_joe_>	 eqsin seems unreachable from a good part of europe and the US
[10:27:15] <_joe_>	 and indeed, there was a small dent in the traffic to eqsin text
[10:27:39] <_joe_>	 not large enough to earn a depool IMHO
[10:27:46] <_joe_>	 akosiaris, elukey ^^
[10:29:09] <elukey>	 ack
[10:32:11] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 14 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[10:33:05] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 4 probes of 448 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[10:33:51] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: allow reaching the local endpoint [puppet] - 10https://gerrit.wikimedia.org/r/505222 (https://phabricator.wikimedia.org/T215339)
[10:37:31] <hashar>	 xSavitar: yes done! thank you :)
[10:38:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Adds the ferm rule in labs:" [puppet] - 10https://gerrit.wikimedia.org/r/505222 (https://phabricator.wikimedia.org/T215339) (owner: 10Giuseppe Lavagetto)
[11:19:03] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] "Some minor comments inline :)" (036 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov)
[11:53:00] <wikibugs>	 10Operations, 10ops-eqiad, 10Operations-Software-Development: rack/setup/install cumin1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10faidon)
[12:01:16] <wikibugs>	 (03PS1) 10Urbanecm: Remove uploader user group from fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505228 (https://phabricator.wikimedia.org/T221441)
[12:01:57] <wikibugs>	 (03PS2) 10Urbanecm: Remove uploader user group from fawiki and merge it with autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505228 (https://phabricator.wikimedia.org/T221441)
[12:04:38] <wikibugs>	 (03CR) 10Urbanecm: "Run mwscript --wiki=fawiki 'uploader' after deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505228 (https://phabricator.wikimedia.org/T221441) (owner: 10Urbanecm)
[12:17:34] <wikibugs>	 (03PS5) 10Urbanecm: Prepare initial configuration for initiativeswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375)
[12:17:51] <wikibugs>	 (03PS6) 10Urbanecm: Prepare initial configuration for initiativeswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375)
[12:53:14] <wikibugs>	 (03PS1) 10Gilles: Enable Priority Hints and Element Timing on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505237 (https://phabricator.wikimedia.org/T216499)
[12:55:32] <wikibugs>	 (03CR) 10Gilles: [C: 03+2] Enable Priority Hints and Element Timing on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505237 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles)
[12:56:34] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Priority Hints and Element Timing on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505237 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles)
[12:56:57] <wikibugs>	 (03CR) 10jenkins-bot: Enable Priority Hints and Element Timing on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505237 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles)
[12:59:38] <logmsgbot>	 !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T216499 T216598 Enable Priority Hints and Element Timing on eswiki (duration: 00m 56s)
[12:59:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:46] <stashbot>	 T216499: Priority Hints origin trial - https://phabricator.wikimedia.org/T216499
[12:59:46] <stashbot>	 T216598: Element Timing for Images origin trial - https://phabricator.wikimedia.org/T216598
[13:07:21] <wikibugs>	 (03PS12) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552)
[13:11:36] <wikibugs>	 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10GedHaywood) I am the "User in #2019041710004636" mentioned in the first comment and a very novice subscriber to Phabricator.  Our...
[13:22:44] <wikibugs>	 (03CR) 10Thcipriani: [C: 04-1] "From the gerrit.config docs auth.autoUpdateAccountActiveStatus:" [puppet] - 10https://gerrit.wikimedia.org/r/505218 (https://phabricator.wikimedia.org/T218654) (owner: 10Hashar)
[13:28:57] <andrewbogott>	 jijiki or fsero, can you catch me up on the state of the VMs in hat-imagescaler?  I'm doing some routine repair of puppet on VMs.   bonny.hat-imagescalers.eqiad.wmflabs is unreachable; docker-registry-test.hat-imagescalers.eqiad.wmflabs and hat-deploy1.hat-imagescalers.eqiad.wmflabs have had broken puppet for ages.
[13:30:04] <jijiki>	 bonny can be killed 
[13:30:24] <jijiki>	 docker-registry-test you will have to wait for f.sero 
[13:30:32] <fsero>	 You can kill it too
[13:30:46] <fsero>	 andrewbogott: 
[13:30:56] <fsero>	 Deploy1 I'm unaware of it
[13:31:02] <jijiki>	 hat-deploy1 is something we are working with tyler, I propose we leave it as is
[13:31:27] <andrewbogott>	 ok, I'll delete those two
[13:31:36] <jijiki>	 thank you andrew
[13:31:56] <andrewbogott>	 Leaving puppet broken for long periods of time is really a pain for me — for instance right now I'm trying to decom the old nameserver but puppet is the way I tell VMs about the new nameserver.
[13:33:34] <danmichaelo>	 Hi, https://commons.wikimedia.org/wiki/Special:Log is timing out atm. (error: "entire web request took longer than 60 seconds and timed out"), is this related to a known issue?
[13:34:49] <andrewbogott>	 jbond42: can you assist me with puppet repair on jbond-buster.puppet.eqiad.wmflabs, jbond-jessie.puppet.eqiad.wmflabs, jbond-puppet-client.puppet.eqiad.wmflabs?  (And maybe jmm-buster.puppet.eqiad.wmflabs while we're at it)
[13:35:12] <andrewbogott>	 jijiki: if you just want to 'git stash' or temporarily unapply the broken classes or whatever and get in one clean puppet run that'll get you out of the woods for now.
[13:36:59] <jijiki>	 andrewbogott: if we just shutdown the VM, would it work ?
[13:37:12] <jijiki>	 the work we were doing is stalled for now
[13:37:17] <andrewbogott>	 if it's shut down then it definitely won't get puppet updates :)
[13:38:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Had a chat with Andrew on IRC and we discussed some stuff related to this new service. Even 'schema.wikimedia.org' was mentioned for the m" [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[13:38:31] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Patch-For-Review, 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Nuria)
[13:39:14] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "I re-dug through the history on this and reconfirmed everything, and I agree this is the right thing to do here.  Might want to add an add" [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288) (owner: 10Herron)
[13:39:30] <jijiki>	 andrewbogott: if it is blocker for you, shut it down
[13:39:38] <jijiki>	 I will deal with when we need it again 
[13:39:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "I've created all the mcrouter certs for these hosts." [puppet] - 10https://gerrit.wikimedia.org/r/504794 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn)
[13:40:54] <wikibugs>	 (03PS13) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552)
[13:41:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[13:44:27] <wikibugs>	 (03PS14) 10Ottomata: Puppetize event schema http service and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552)
[13:45:04] <andrewbogott>	 jijiki: I can see about removing the broken classes myself.  Shutting it down doesn't help since I need VMs to get puppet updates in order to keep working, getting security patches, etc.
[13:45:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Puppetize event schema http service and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[13:45:46] <wikibugs>	 (03PS15) 10Ottomata: Puppetize event schema http service and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552)
[13:46:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Puppetize event schema http service and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[13:47:15] <wikibugs>	 (03PS16) 10Ottomata: Puppetize event schema http service and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552)
[13:48:44] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "No op for /srv/event-schemas, schema1001 doesn't exist yet in PCC catalog:" [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[13:48:59] <wikibugs>	 (03PS17) 10Ottomata: Puppetize event schema http service and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552)
[13:49:07] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Puppetize event schema http service and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[13:49:29] <andrewbogott>	 godog: ok if I enable puppet on filippo-log-jessie01.logging.eqiad.wmflabs?  
[13:49:46] <andrewbogott>	 (There are a couple of other VMs in that project with failed puppet catalogs too, if you care to look)
[13:50:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "What kind of repercussions will this have for automated tooling? Things like git pull via puppet-merge, sync-git-upstream used in labs or " [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar)
[13:53:10] <wikibugs>	 (03PS1) 10Ottomata: Use Optional[String] for eventschema::service server_alias [puppet] - 10https://gerrit.wikimedia.org/r/505244 (https://phabricator.wikimedia.org/T219552)
[13:53:31] <icinga-wm>	 PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:53:34] <jijiki>	 andrewbogott: I will let you know in a bit
[13:53:41] <andrewbogott>	 thanks!
[13:53:58] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Use Optional[String] for eventschema::service server_alias [puppet] - 10https://gerrit.wikimedia.org/r/505244 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[13:55:15] <icinga-wm>	 PROBLEM - puppet last run on schema1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:55:19] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] "> What kind of repercussions will this have for automated tooling?" [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar)
[13:56:15] <icinga-wm>	 PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:57:09] <wikibugs>	 (03CR) 10Volans: "FYI This needs to be rebased and has conflicts. Couple of pending comments inline, but I didn't review the whole last PS. I'll do after th" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[13:58:07] <icinga-wm>	 PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:58:27] <jijiki>	 ottomata: is that you ^ ?
[13:58:49] <ottomata>	 kafka2003!  was a no op in pcc
[13:58:50] <ottomata>	 probabaly me
[13:58:50] <ottomata>	 checking
[13:59:46] <jijiki>	 andrewbogott: tyler deleted the VM, so we are good
[13:59:54] <andrewbogott>	 that's easy :)  thank you!
[14:01:09] <wikibugs>	 10Operations: Broken puppet in the 'logging' project - https://phabricator.wikimedia.org/T221450 (10Andrew)
[14:01:13] <icinga-wm>	 PROBLEM - puppet last run on schema1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:01:27] <ottomata>	 ^ is me
[14:01:36] <wikibugs>	 (03PS1) 10Ottomata: Fix source url for eventschemas service document_root [puppet] - 10https://gerrit.wikimedia.org/r/505245 (https://phabricator.wikimedia.org/T219552)
[14:01:38] <wikibugs>	 (03PS1) 10Ottomata: Fix reference to subscribe eventschemas::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/505246 (https://phabricator.wikimedia.org/T219552)
[14:02:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix source url for eventschemas service document_root [puppet] - 10https://gerrit.wikimedia.org/r/505245 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[14:04:52] <wikibugs>	 (03PS2) 10Ottomata: Fix source url for eventschemas service document_root [puppet] - 10https://gerrit.wikimedia.org/r/505245 (https://phabricator.wikimedia.org/T219552)
[14:05:20] <wikibugs>	 (03CR) 10Thcipriani: "> What kind of repercussions will this have for automated tooling?" [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar)
[14:05:40] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Fix source url for eventschemas service document_root [puppet] - 10https://gerrit.wikimedia.org/r/505245 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[14:05:51] <wikibugs>	 (03PS2) 10Ottomata: Fix reference to subscribe eventschemas::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/505246 (https://phabricator.wikimedia.org/T219552)
[14:05:58] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix reference to subscribe eventschemas::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/505246 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[14:06:03] <icinga-wm>	 PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:09:25] <icinga-wm>	 PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:09:25] <icinga-wm>	 RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:10:47] <icinga-wm>	 PROBLEM - Check systemd state on schema1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:10:56] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) Had an interesting chat with Gilles today about his use case....
[14:11:19] <icinga-wm>	 PROBLEM - Check systemd state on schema2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:11:21] <icinga-wm>	 PROBLEM - puppet last run on schema2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_clone_mediawiki/eventschemas]
[14:12:25] <wikibugs>	 (03PS1) 10Ottomata: eventschemas service fixes [puppet] - 10https://gerrit.wikimedia.org/r/505247 (https://phabricator.wikimedia.org/T219552)
[14:13:31] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventschemas service fixes [puppet] - 10https://gerrit.wikimedia.org/r/505247 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[14:14:23] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] site/conftool: assign mw2150 jobrunner, mw2244,mw2245 API servers [puppet] - 10https://gerrit.wikimedia.org/r/504794 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn)
[14:15:12] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: Give roles to the new kubernetes[12]00[56] VMs [puppet] - 10https://gerrit.wikimedia.org/r/504851 (https://phabricator.wikimedia.org/T220822)
[14:15:59] <icinga-wm>	 RECOVERY - Check systemd state on schema1001 is OK: OK - running: The system is fully operational
[14:16:25] <icinga-wm>	 RECOVERY - puppet last run on schema1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[14:18:44] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.23 [software/spicerack] - 10https://gerrit.wikimedia.org/r/505248
[14:20:00] <wikibugs>	 (03PS1) 10BBlack: wikipedia.org CNAME experiment: 4H CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/505249 (https://phabricator.wikimedia.org/T208263)
[14:20:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Give roles to the new kubernetes[12]00[56] VMs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504851 (https://phabricator.wikimedia.org/T220822) (owner: 10Alexandros Kosiaris)
[14:20:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Give roles to the new kubernetes[12]00[56] VMs [puppet] - 10https://gerrit.wikimedia.org/r/504851 (https://phabricator.wikimedia.org/T220822) (owner: 10Alexandros Kosiaris)
[14:20:43] <wikibugs>	 (03PS1) 10Ottomata: Add schema.svc IPs for eqiad and codfw [dns] - 10https://gerrit.wikimedia.org/r/505250 (https://phabricator.wikimedia.org/T219552)
[14:21:20] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add schema.svc IPs for eqiad and codfw [dns] - 10https://gerrit.wikimedia.org/r/505250 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[14:22:21] <icinga-wm>	 RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:24:04] <wikibugs>	 (03PS2) 10Ottomata: Add schema.svc IPs for eqiad and codfw [dns] - 10https://gerrit.wikimedia.org/r/505250 (https://phabricator.wikimedia.org/T219552)
[14:24:05] <icinga-wm>	 RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:25:12] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add schema.svc IPs for eqiad and codfw [dns] - 10https://gerrit.wikimedia.org/r/505250 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[14:28:09] <wikibugs>	 (03Restored) 10Chico Venancio: InitialiseSettings.php: add years to wgNamespacesWithSubpages for WikimaniaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504798 (https://phabricator.wikimedia.org/T221297) (owner: 10Chico Venancio)
[14:28:11] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1005 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Service[docker],Package[kubernetes-node],Package[docker-engine],Package[docker-registry.discovery.wmnet/calico/node]
[14:29:03] <wikibugs>	 (03Abandoned) 10Chico Venancio: InitialiseSettings.php: add years to wgNamespacesWithSubpages for WikimaniaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504798 (https://phabricator.wikimedia.org/T221297) (owner: 10Chico Venancio)
[14:29:27] <wikibugs>	 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10jijiki) @Gilles This is fixed now, I will though revert back to nginx for the weekend. We do have data we can work with from today.
[14:30:21] <wikibugs>	 (03PS3) 10Hashar: gerrit: reduce sshd.MaxConnectionsPerUser 32 -> 4 [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756)
[14:30:27] <icinga-wm>	 PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:30:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Cmjohnson)
[14:31:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Cmjohnson)
[14:31:19] <wikibugs>	 (03CR) 10Hashar: "Since jenkins-bot is in the Non Interactive group, it is already in the  limited batch queue. Then that is a queue so actions just pills u" [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar)
[14:31:34] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1002 with 10G interfaces - https://phabricator.wikimedia.org/T221140 (10Cmjohnson)
[14:31:56] <wikibugs>	 10Operations, 10Traffic: Puppet broken on two VMs in the 'traffic' project - https://phabricator.wikimedia.org/T221454 (10Andrew)
[14:32:03] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[docker]
[14:32:31] <wikibugs>	 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1002 with 10G interfaces - https://phabricator.wikimedia.org/T221140 (10Cmjohnson) a:05Cmjohnson→03Andrew removing ops-eqiad tag and assigning to @Andrew
[14:33:17] <icinga-wm>	 RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:33:19] <wikibugs>	 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Cmjohnson) a:05Cmjohnson→03Andrew Removing the ops-eqiad tag and assigning to @Andrew
[14:37:17] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1006 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[14:37:31] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2005 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Service[docker],Package[kubernetes-node],Package[docker-engine],Package[docker-registry.discovery.wmnet/calico/node]
[14:39:03] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.23 [software/spicerack] - 10https://gerrit.wikimedia.org/r/505248 (owner: 10Volans)
[14:40:21] <icinga-wm>	 RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[14:43:06] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.23 [software/spicerack] - 10https://gerrit.wikimedia.org/r/505248 (owner: 10Volans)
[14:43:51] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:44:11] <wikibugs>	 (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.23 [software/spicerack] - 10https://gerrit.wikimedia.org/r/505248 (owner: 10Volans)
[14:44:26] <framawiki>	 Special:Log on commons currently timeouts after 60s, is it kown? https://commons.wikimedia.org/wiki/Special:Log
[14:45:33] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.0.23 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/505252
[14:46:52] <wikibugs>	 (03PS1) 10Hashar: zuul: log stack dump to their own file [puppet] - 10https://gerrit.wikimedia.org/r/505253
[14:47:13] <wikibugs>	 (03CR) 10Hashar: "Untested.." [puppet] - 10https://gerrit.wikimedia.org/r/505253 (owner: 10Hashar)
[14:47:57] <wikibugs>	 (03PS1) 10Ottomata: LVS for schema.svc [puppet] - 10https://gerrit.wikimedia.org/r/505254
[14:48:03] <p858snake|L>	 framawiki: can I suggest you create a phabricator task please
[14:48:09] <wikibugs>	 10Operations, 10Wikimedia-production-error: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10CDanis)
[14:48:15] <wikibugs>	 10Operations, 10Wikimedia-production-error: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10CDanis) p:05Triage→03High
[14:48:18] <cdanis>	 thanks framawiki, I filed https://phabricator.wikimedia.org/T221458
[14:48:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] LVS for schema.svc [puppet] - 10https://gerrit.wikimedia.org/r/505254 (owner: 10Ottomata)
[14:48:56] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-production-error: Timeout after 30s on Special:Log on Commons - https://phabricator.wikimedia.org/T221459 (10Framawiki)
[14:49:16] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-production-error: Timeout after 30s on Special:Log on Commons - https://phabricator.wikimedia.org/T221459 (10Framawiki)
[14:49:18] <wikibugs>	 10Operations, 10Wikimedia-production-error: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Framawiki)
[14:49:35] <wikibugs>	 (03PS2) 10Ottomata: LVS for schema.svc [puppet] - 10https://gerrit.wikimedia.org/r/505254 (https://phabricator.wikimedia.org/T219552)
[14:50:54] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v0.0.23 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/505252 (owner: 10Volans)
[14:51:24] <wikibugs>	 10Operations, 10Wikimedia-production-error: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10CDanis) Looks like database being slow?  Pretty sure this is a MW API call backing the pageload of Special:Log on commonswiki.  `#0 /...
[14:52:08] <framawiki>	 Thanks for the task cdanis 
[14:53:17] <cdanis>	 np
[14:53:18] <wikibugs>	 10Operations, 10Wikimedia-production-error: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10CDanis) Found a logstash fatal that definitely implicates database on a commonswiki Special:Log pageload https://logstash.wikimedia.o...
[14:54:31] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database, 10Wikimedia-production-error: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10CDanis)
[14:54:58] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v0.0.23 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/505252 (owner: 10Volans)
[14:55:04] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2006 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 seconds ago with 3 failures. Failed resources (up to 3 shown): Service[docker],Package[kubernetes-node],Package[docker-engine],Package[docker-registry.discovery.wmnet/calico/node]
[14:56:17] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) >so the plan would be to install thumbor on stat1005  How would...
[14:56:18] <icinga-wm>	 RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:58:24] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database, 10Wikimedia-production-error: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Framawiki) Is {T221380} related?
[14:59:17] <volans>	 !log uploaded spicerack_0.0.23-1_amd64.deb to apt.wikimedia.org stretch-wikimedia
[14:59:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:28] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) I think the idea would be to run it with/without GPU active an...
[14:59:32] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15910/" [puppet] - 10https://gerrit.wikimedia.org/r/505254 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[15:00:22] <wikibugs>	 (03PS3) 10Ottomata: LVS for schema.svc [puppet] - 10https://gerrit.wikimedia.org/r/505254 (https://phabricator.wikimedia.org/T219552)
[15:00:50] <wikibugs>	 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10User-jijiki: Site: 4 VM request for kubernetes - https://phabricator.wikimedia.org/T220822 (10akosiaris) a:03ayounsi This is almost done. That only thing missing seems to be the peering with the juniper routers.   @Ayounsi, could you plea...
[15:02:01] <wikibugs>	 (03CR) 10Hashar: "To clarify. Zuul scheduler actually has two permanent connections:" [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar)
[15:02:32] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] LVS for schema.svc [puppet] - 10https://gerrit.wikimedia.org/r/505254 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[15:02:40] <wikibugs>	 (03PS4) 10Ottomata: LVS for schema.svc [puppet] - 10https://gerrit.wikimedia.org/r/505254 (https://phabricator.wikimedia.org/T219552)
[15:02:44] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] LVS for schema.svc [puppet] - 10https://gerrit.wikimedia.org/r/505254 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[15:03:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Merging this in the interest of keeping things homogeneous while working on the actual groups that have access to kubernetes tokens" [puppet] - 10https://gerrit.wikimedia.org/r/503167 (https://phabricator.wikimedia.org/T220785) (owner: 10KartikMistry)
[15:03:18] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:03:21] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add santhosh to deploy-service [puppet] - 10https://gerrit.wikimedia.org/r/503167 (https://phabricator.wikimedia.org/T220785) (owner: 10KartikMistry)
[15:06:17] <wikibugs>	 (03CR) 10CRusnov: "nitpiiiiicks" (036 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov)
[15:07:41] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access for santhosh - https://phabricator.wikimedia.org/T220785 (10akosiaris) 05Open→03Resolved Change merged, thanks!
[15:08:08] <wikibugs>	 (03PS9) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422)
[15:08:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov)
[15:08:58] <wikibugs>	 (03PS1) 10Ottomata: Fix eventschemas name in services.yaml [puppet] - 10https://gerrit.wikimedia.org/r/505258 (https://phabricator.wikimedia.org/T219552)
[15:09:16] <icinga-wm>	 PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:09:38] <icinga-wm>	 PROBLEM - puppet last run on lvs4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:09:44] <icinga-wm>	 PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:09:44] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.43:8190]) https://wikitech.wikimedia.org/wiki/PyBal
[15:09:52] <icinga-wm>	 PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:10:09] <akosiaris>	 hmm
[15:10:22] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 46 connections established with conf1004.eqiad.wmnet:4001 (min=47) https://wikitech.wikimedia.org/wiki/PyBal
[15:10:40] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[15:11:01] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Fix eventschemas name in services.yaml [puppet] - 10https://gerrit.wikimedia.org/r/505258 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[15:11:02] <icinga-wm>	 PROBLEM - puppet last run on lvs5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:11:13] <wikibugs>	 (03PS2) 10Ottomata: Fix eventschemas name in services.yaml and service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/505258 (https://phabricator.wikimedia.org/T219552)
[15:11:14] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.43:8190]) https://wikitech.wikimedia.org/wiki/PyBal
[15:11:17] <bblack>	 why is puppet failing on LVSes?
[15:11:22] <icinga-wm>	 PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:11:22] <icinga-wm>	 PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:11:25] <akosiaris>	 I have no idea, investigating
[15:11:26] <ottomata>	 me ^
[15:11:29] <akosiaris>	 ah
[15:11:36] <bblack>	 heh, revert revert
[15:11:41] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Fix eventschemas name in services.yaml and service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/505258 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[15:11:47] <bblack>	 or fix I guess
[15:12:15] <bblack>	 why are we putting a new service into LVS on a friday (and an EU holiday!)
[15:12:22] <ottomata>	 i ran pcc it was fine...
[15:12:44] <ottomata>	 bblack:  i can revert if you like, didn't think it would matter since it is new, static files only, and not external
[15:13:12] <icinga-wm>	 PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:13:12] <bblack>	 pcc just means that the puppet compiler is happy with your change to some degree, it doesn't really tell you much about any real impact on systems
[15:13:12] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: kubelet operational latencies: increase thresholds by 10x [puppet] - 10https://gerrit.wikimedia.org/r/503079 (https://phabricator.wikimedia.org/T219556) (owner: 10CDanis)
[15:13:29] <ottomata>	 aye, but didn't expect the catalog to fail
[15:13:55] <wikibugs>	 (03PS10) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422)
[15:14:32] <icinga-wm>	 RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[15:14:38] <bblack>	 yeah hopefully the duplicate eventbus name doesn't leave some imperfect states in places where the catalog didn't fail (lvs1016?)
[15:15:00] <icinga-wm>	 RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[15:15:46] <bblack>	 Apr 19 15:05:44 lvs1016 lldpd[2179]: 2019-04-19T15:05:44 [INFO/netlink] removal request for address of 10.2.2.31%1, but no knowledge of it
[15:15:57] <bblack>	 all kinds of strangeness
[15:16:12] <wikibugs>	 (03PS1) 10Ottomata: Use profile::lvs::realserver::pools in eventschemas service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/505261 (https://phabricator.wikimedia.org/T219552)
[15:16:13] <ottomata>	 yikesss.
[15:16:16] <akosiaris>	 ottomata: bblack has a point though. I 'd add also that the change wasn't reviewed by anyone
[15:16:29] <akosiaris>	 and this is lvs...
[15:17:07] <ottomata>	 ya akosiaris elu key reviewed it quickly, but i think you guys are right, i read the 'Always have your changed reviewed' in the LVS docs after I merged.  i'm sorry about thaat
[15:17:10] <ottomata>	 shall I revert?
[15:17:16] <bblack>	 that change could potentially break literally every internal service we have :)
[15:17:30] <bblack>	 the revert might too for all I know, we're in an unknown state, give me some time to stare at things
[15:17:56] <ottomata>	 yikes, ok, did not expect this to affect other things, was just trying to add a new low priority thing.
[15:18:29] <icinga-wm>	 PROBLEM - Host schema.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100%
[15:18:31] <icinga-wm>	 PROBLEM - Host schema.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100%
[15:18:43] <bblack>	 I get the "just a new service" thing, but deploying new LVS config is complex and fraught with peril (which isn't your fault!), and can break things for all the other critical services going through LVS
[15:18:44] <ottomata>	 hm, tried to do the alert puppet disable
[15:18:50] <herron>	 those paged out, need any help?
[15:19:04] <ottomata>	 no, sorry herron, this is my fault.
[15:19:08] <ottomata>	 i marked those as non critical
[15:19:10] <wikibugs>	 (03CR) 10Hashar: "The setting is for the auth backend which remains auth.type = LDAP ?   The AccountDeactivator has only been implemented for LDAP to verify" [puppet] - 10https://gerrit.wikimedia.org/r/505218 (https://phabricator.wikimedia.org/T218654) (owner: 10Hashar)
[15:19:14] <bblack>	 for those getting pages: you can ignore, nothing's yet critical as far as we know
[15:19:20] <herron>	 kk no worries I’m around lmk
[15:19:25] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Use profile::lvs::realserver::pools in eventschemas service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/505261 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[15:20:10] <apergos>	 no pages for me (yet)
[15:20:41] <volans>	 ack
[15:21:09] <cdanis>	 am I being dumb or do I not see the puppet failures in the syslogs of these machines?
[15:21:41] <bblack>	 log-invisible puppet failures are a thing heh
[15:21:48] <bblack>	 there's puppetboard or something to look at them
[15:21:51] <cdanis>	 D:
[15:21:55] <icinga-wm>	 ACKNOWLEDGEMENT - LVS HTTP IPv4 on schema.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds ottomata new service https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[15:22:25] <chaomodus>	 oy
[15:22:44] <cdanis>	 https://puppetboard.wikimedia.org/node/lvs2001.codfw.wmnet
[15:22:49] <cdanis>	 the 'failed' run has nothing in it
[15:23:13] <_joe_>	 cdanis: what are you trying to understand?
[15:23:53] <wikibugs>	 (03PS2) 10Elukey: admin: add the analytics system user to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971)
[15:23:54] <bblack>	 he just wanted to see the agent output for the failed puppet run icinga complained about, I think
[15:23:57] <_joe_>	 AFAICS the last puppet run on lvs2001 was successful, 17 minutes ago
[15:24:03] <bblack>	 but I guess if catalog compilation fails there's not much to see
[15:24:15] <bblack>	 _joe_:
[15:24:18] <bblack>	 15:11 <+icinga-wm> PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:24:19] <cdanis>	 yeah that's right bblack
[15:24:28] <_joe_>	 bblack: no you're wrong
[15:24:33] <_joe_>	 we usually see it 
[15:24:38] <bblack>	 ok
[15:24:46] <_joe_>	 but the time doesn't match a time where execution would've happened
[15:24:48] <_joe_>	 unless
[15:24:54] <bblack>	 unless someone cumin'd it
[15:24:57] <_joe_>	 someone ran puppet in a funny way via cumin
[15:24:59] <bblack>	 which also doesn't result in logs anywhere
[15:25:07] <_joe_>	 well, no, it does
[15:25:15] <icinga-wm>	 RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[15:25:21] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 46 connections established with conf1004.eqiad.wmnet:4001 (min=47) https://wikitech.wikimedia.org/wiki/PyBal
[15:25:21] <_joe_>	 cumin logs would have it
[15:25:27] <_joe_>	 but why not puppet.log
[15:25:34] <bblack>	 anyways, this is all tertiary to the point, I should keep staring at lvs state for now
[15:25:49] <_joe_>	 yeah sorry
[15:26:03] <_joe_>	 it looks like something is wrong on lvs1006?
[15:26:13] <bblack>	 I can't quite figure out why the strange netlink error messages on lvs1016
[15:26:32] <bblack>	 lvs1006 I just did a manual agent run on, and it applied the same-ish change without all the netlink messages
[15:26:32] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] admin: add the analytics system user to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey)
[15:26:39] <cdanis>	 I see a cumin execution of run-puppet-agent on conf* and lvs* at 15:05
[15:26:49] <ottomata>	 that was me
[15:27:25] <icinga-wm>	 RECOVERY - puppet last run on lvs5001 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures
[15:27:28] <bblack>	 _joe_: the lvs1006 message from icinga is basically telling us "something new was puppetized into pybal config and pybal hasn't been restarted to make it take effect", I think
[15:27:51] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.43:8190]) https://wikitech.wikimedia.org/wiki/PyBal
[15:28:09] <wikibugs>	 (03PS16) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072)
[15:29:00] <icinga-wm>	 RECOVERY - Host schema.svc.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms
[15:29:33] <bblack>	 anyways, right now I'm just trying to confirm the live state of lvs1006+lvs1016 are sane and identical after puppeting both with the followup fix, then will step through the restarts
[15:29:47] <bblack>	 it's simpler than reverting at this point, assuming nothing's really wrong (nothing else seems to be impacted so far)
[15:30:33] <ottomata>	 ok bblack, am ready to revert if you think that is better, just let me know.  super sorry about this.  that was really dumb and dangerouss of me.
[15:30:46] <wikibugs>	 (03PS17) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072)
[15:31:51] <icinga-wm>	 RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[15:32:19] <bblack>	 !log restarting pybal on lvs1006 (eqiad backup) for eventschema service add
[15:32:21] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 15:" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[15:32:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:03] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.43:8190]) https://wikitech.wikimedia.org/wiki/PyBal
[15:33:16] <bblack>	 ugh I guess there was a codfw version too heh
[15:33:25] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 37 connections established with conf2001.codfw.wmnet:2379 (min=38) https://wikitech.wikimedia.org/wiki/PyBal
[15:33:37] <icinga-wm>	 RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:34:29] <icinga-wm>	 RECOVERY - Host schema.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[15:34:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[15:35:25] <icinga-wm>	 RECOVERY - puppet last run on lvs4006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[15:35:43] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 47 connections established with conf1004.eqiad.wmnet:4001 (min=47) https://wikitech.wikimedia.org/wiki/PyBal
[15:36:45] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 37 connections established with conf2001.codfw.wmnet:2379 (min=38) https://wikitech.wikimedia.org/wiki/PyBal
[15:37:05] <icinga-wm>	 RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[15:37:15] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on schema.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.43 and port 8190: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[15:37:22] <cdanis>	 bblack: can I help?
[15:37:32] <wikibugs>	 (03PS18) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072)
[15:39:37] <bblack>	 !log restart pybal on lvs2006 (codfw backup) for eventscehmas service add
[15:39:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:53] <bblack>	 cdanis: I think we're good, I just need to step through this process and clean up the alerts, etc
[15:40:20] <bblack>	 the netlink bit is "mysterious", but doesn't seem to have any practical importance
[15:40:36] <ottomata>	 i added a note here to warn my future self and others: https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service
[15:40:43] <bblack>	 (for all I know they're fairly normal and innocuous when we add new LVS service IPs via wikimedia-lvs-realserver)
[15:42:03] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 38 connections established with conf2001.codfw.wmnet:2379 (min=38) https://wikitech.wikimedia.org/wiki/PyBal
[15:42:45] <bblack>	 !log restart pybal on lvs2003 (codfw primary) for eventscehmas service add
[15:42:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:43] <bblack>	 the pybal ipvs diff check is still credibly failing for the new service, I'll dig into that a bit more
[15:43:50] <bblack>	 but the etcd thing does clear up on restarts
[15:43:59] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 38 connections established with conf2001.codfw.wmnet:2379 (min=38) https://wikitech.wikimedia.org/wiki/PyBal
[15:46:24] <cdanis>	 looking at logs on the rsyslog central servers, we commonly get that message on cloudvirts and kube hosts, and somewhat often on ganeti hosts and boron
[15:49:21] <_joe_>	 bblack: that means no pooled servers are present in conftool probably
[15:49:26] <bblack>	 ah I think the diff check is because nothing is pooled
[15:49:33] <_joe_>	 no servers even in pooled=no state prolly
[15:49:34] <bblack>	 heh joe beat me to it
[15:49:35] <bblack>	 bblack@cumin1001:~$ confctl select name='schema.*' get
[15:49:35] <bblack>	 {"schema2001.codfw.wmnet": {"weight": 0, "pooled": "no"}, "tags": "dc=codfw,cluster=eventschemas,service=eventschemas"}
[15:49:38] <bblack>	 {"schema2002.codfw.wmnet": {"weight": 0, "pooled": "no"}, "tags": "dc=codfw,cluster=eventschemas,service=eventschemas"}
[15:49:41] <bblack>	 {"schema1001.eqiad.wmnet": {"weight": 0, "pooled": "no"}, "tags": "dc=eqiad,cluster=eventschemas,service=eventschemas"}
[15:49:44] <bblack>	 {"schema1002.eqiad.wmnet": {"weight": 0, "pooled": "no"}, "tags": "dc=eqiad,cluster=eventschemas,service=eventschemas"}
[15:49:47] <bblack>	 ottomata: can we pool them?
[15:49:55] <ottomata>	 yes
[15:50:04] <ottomata>	 do they need to be manually pooled?
[15:50:27] <bblack>	 yeah everything by default gets created in depooled states
[15:50:54] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=schema.*
[15:50:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:59] <_joe_>	 ottomata: you know, usually you want to be able to add a service to conftool, then pool the individual servers. But there is a way to invert that behaviour, by changing the defaults for that service, in conftool-data
[15:51:08] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[15:51:20] <_joe_>	 and pybal picks that up, nice
[15:51:30] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on schema.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 317 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[15:51:36] <_joe_>	 used to be broken, kudos to all that have worked on it
[15:52:20] <wikibugs>	 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10Krenair) >>! In T221288#5125169, @GedHaywood wrote: > I am the "User in #2019041710004636" mentioned in the first comment and a ve...
[15:52:29] <ottomata>	 nice looks good.   thanks bblack and all
[15:52:39] <ottomata>	 again very sorry for the extra unplanned and frantic work there
[15:52:57] <wikibugs>	 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10Krenair)
[15:53:34] <wikibugs>	 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10Krenair)
[15:53:38] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[15:53:40] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[15:54:07] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.34.0-wmf.1/includes/Linker.php: T220767 (duration: 00m 55s)
[15:54:07] <bblack>	 !log restart pybal on lvs1016 (eqiad primary) for eventscehmas service add
[15:54:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:12] <stashbot>	 T220767: Some special pages are not properly displaying parenthesis and other seperators around user links - https://phabricator.wikimedia.org/T220767
[15:54:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:16] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.34.0-wmf.1/includes/logging/LogFormatter.php: T220767 (duration: 00m 53s)
[15:55:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:16] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 47 connections established with conf1004.eqiad.wmnet:4001 (min=47) https://wikitech.wikimedia.org/wiki/PyBal
[15:58:53] <bblack>	 all the LVS/eventschemas -related icinga alerts are clear now, we should be good!
[15:59:07] <ottomata>	 ok great.  
[15:59:08] <ottomata>	 thank you.
[15:59:21] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] "Tested locally and got this working, i think we should reduce the time to lets say 5mins or 10mins?" [puppet] - 10https://gerrit.wikimedia.org/r/505218 (https://phabricator.wikimedia.org/T218654) (owner: 10Hashar)
[15:59:24] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:02:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[16:04:49] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: First version of the kask chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401)
[16:09:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Let me know what you think. From what I see it dies trying to connect cassandra if it doesn't exist, but that's rather easy to bypass in t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris)
[16:19:07] <apergos>	 now I get the schema crit page? Now??
[16:19:14] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] wikipedia.org CNAME experiment: 4H CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/505249 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack)
[16:19:16] <apergos>	 that's really pointless
[16:19:18] <wikibugs>	 (03PS2) 10BBlack: wikipedia.org CNAME experiment: 4H CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/505249 (https://phabricator.wikimedia.org/T208263)
[16:19:29] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database, 10MediaWiki-Logging, 10Wikimedia-production-error: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Aklapper)
[16:20:01] <apergos>	 aand here come the rest (recovery, crit, recovery)
[16:20:33] <bblack>	 !log wikipedia.org CNAME TTLs increase to 4H - https://gerrit.wikimedia.org/r/c/operations/dns/+/505249 - T208263
[16:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:38] <stashbot>	 T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263
[16:21:36] <wikibugs>	 (03CR) 10BBlack: "Note - if there becomes a reason to revert this, you'll want to revert the follow-on TTL changes from I9dee96d07a74d8b57b09333305b65922f44" [dns] - 10https://gerrit.wikimedia.org/r/504588 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack)
[16:23:00] <cdanis>	 fwiw bblack those lldpd messages have been seen on lvs hosts before: https://phabricator.wikimedia.org/P8420 and https://phabricator.wikimedia.org/P8421
[16:23:48] <bblack>	 cdanis: ok, thanks!
[16:27:33] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs: Respect LDAP locks in ssh-key-ldap-lookup [puppet] - 10https://gerrit.wikimedia.org/r/505025 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis)
[16:29:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs: Respect LDAP locks in ssh-key-ldap-lookup [puppet] - 10https://gerrit.wikimedia.org/r/505025 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis)
[16:34:14] <wikibugs>	 (03PS8) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[16:49:18] <wikibugs>	 (03PS2) 10Krinkle: webperf: Remove arclamp subscriber from mwlog servers [puppet] - 10https://gerrit.wikimedia.org/r/503675 (https://phabricator.wikimedia.org/T195312)
[16:50:46] <wikibugs>	 (03CR) 10Krinkle: "Puppet compiler showing the to-be-removed resources:" [puppet] - 10https://gerrit.wikimedia.org/r/503675 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[16:52:51] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "I will merge this next week, after merging we will need to" [puppet] - 10https://gerrit.wikimedia.org/r/503675 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[16:55:20] <wikibugs>	 (03PS9) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[17:02:08] <wikibugs>	 (03PS10) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[17:05:51] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10dr0ptp4kt) (Detour)  @Nuria the other day I mentioned my project aroun...
[17:19:41] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Cmjohnson)
[17:28:20] <wikibugs>	 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Cmjohnson) a:05Cmjohnson→03Andrew Removing ops-eqiad tag and assigning to @Andrew
[17:28:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Cmjohnson)
[17:29:19] <wikibugs>	 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Cmjohnson) a:05Cmjohnson→03Andrew Removing ops-eqiad tag and assigning to @Andrew
[17:29:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Cmjohnson)
[17:30:20] <wikibugs>	 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Cmjohnson) a:05Cmjohnson→03Andrew Removing ops-eqiad tag and assigning to @Andrew
[17:30:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10Cmjohnson)
[17:30:58] <wikibugs>	 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10Cmjohnson) a:05Cmjohnson→03Andrew Removing ops-eqiad tag and assigning to @Andrew
[17:40:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1007 predicted raid failure - https://phabricator.wikimedia.org/T209861 (10Cmjohnson)
[17:40:34] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1007 predicted raid failure - https://phabricator.wikimedia.org/T209861 (10Cmjohnson) Created a procurement ticket T221470
[17:41:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10Cmjohnson)
[17:41:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10Cmjohnson) I created a procurement task T221470
[17:43:07] <wikibugs>	 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10Cmjohnson) a:05Cmjohnson→03Andrew Removing ops-eqiad tag and assigning to Andrew for install.   please add tag back if there are any h/w issues
[17:55:39] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` phab1003.eqiad.wmnet ` The log can be found in `/var/log/wmf-aut...
[18:10:20] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 (10cwdent)
[18:14:18] <wikibugs>	 (03Abandoned) 10Ottomata: WIP Serve event-schemas repo via http [puppet] - 10https://gerrit.wikimedia.org/r/482867 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata)
[18:15:20] <icinga-wm>	 PROBLEM - HHVM rendering on mw2143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[18:16:14] <icinga-wm>	 RECOVERY - HHVM rendering on mw2143 is OK: HTTP OK: HTTP/1.1 200 OK - 80329 bytes in 0.313 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[18:21:17] <wikibugs>	 (03PS1) 10CDanis: Merge "add a .gitreview" [software/conftool] - 10https://gerrit.wikimedia.org/r/505279
[18:21:32] <wikibugs>	 (03Abandoned) 10CDanis: Merge "add a .gitreview" [software/conftool] - 10https://gerrit.wikimedia.org/r/505279 (owner: 10CDanis)
[18:21:47] <wikibugs>	 (03PS1) 10Ottomata: Enable cirrussearch-request logs via eventgate in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505280 (https://phabricator.wikimedia.org/T214080)
[18:29:08] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['phab1003.eqiad.wmnet'] `  and were **ALL** successful.
[18:29:19] <wikibugs>	 (03PS16) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto)
[18:30:13] <wikibugs>	 (03PS17) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto)
[18:45:28] <wikibugs>	 (03CR) 10EBernhardson: Enable cirrussearch-request logs via eventgate in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505280 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[18:51:01] <wikibugs>	 (03PS11) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[18:51:26] <wikibugs>	 (03PS1) 10Ottomata: Import mediawiki.api-request and mediawiki.cirrussearch-request via Camus into Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/505283 (https://phabricator.wikimedia.org/T214080)
[18:51:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Import mediawiki.api-request and mediawiki.cirrussearch-request via Camus into Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/505283 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[18:53:25] <wikibugs>	 (03PS2) 10Ottomata: Import mediawiki.(api|cirrussearch)-request via Camus into Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/505283 (https://phabricator.wikimedia.org/T214080)
[18:56:18] <wikibugs>	 (03PS1) 10Ottomata: New Refine job to refine events using remote JSONSchemas [puppet] - 10https://gerrit.wikimedia.org/r/505287 (https://phabricator.wikimedia.org/T214080)
[18:57:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] New Refine job to refine events using remote JSONSchemas [puppet] - 10https://gerrit.wikimedia.org/r/505287 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[18:57:50] <wikibugs>	 (03PS2) 10Ottomata: New Refine job to refine events using remote JSONSchemas [puppet] - 10https://gerrit.wikimedia.org/r/505287 (https://phabricator.wikimedia.org/T214080)
[18:58:27] <wikibugs>	 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10herron) Today we discussed desired hardware configs and expansion strategies during a meetin...
[18:59:40] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/15914/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/505283 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[18:59:47] <wikibugs>	 (03PS3) 10Ottomata: Import mediawiki.(api|cirrussearch)-request via Camus into Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/505283 (https://phabricator.wikimedia.org/T214080)
[18:59:50] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Import mediawiki.(api|cirrussearch)-request via Camus into Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/505283 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[19:00:17] <wikibugs>	 (03CR) 10Ottomata: "Requires that refinery-source 0.0.86 is deployed" [puppet] - 10https://gerrit.wikimedia.org/r/505287 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[19:00:22] <wikibugs>	 (03CR) 10Ottomata: [C: 04-1] "-1 for now." [puppet] - 10https://gerrit.wikimedia.org/r/505287 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[19:03:47] <wikibugs>	 (03CR) 10Bstorm: "This is a NOOP for all NFS systems currently in production (see: https://puppet-compiler.wmflabs.org/compiler1002/15913/)" [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[19:06:18] <wikibugs>	 (03PS3) 10Ottomata: New Refine job to refine events using remote JSONSchemas [puppet] - 10https://gerrit.wikimedia.org/r/505287 (https://phabricator.wikimedia.org/T214080)
[19:07:11] <wikibugs>	 (03PS12) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[19:15:51] <wikibugs>	 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException") - https://phabricator.wikimedia.org/T221347 (10Krinkle)
[19:15:54] <wikibugs>	 (03PS13) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[19:17:40] <wikibugs>	 (03PS14) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[19:21:19] <wikibugs>	 (03PS2) 10EBernhardson: Enable cirrussearch-request logs via eventgate in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505280 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[19:22:30] <wikibugs>	 (03CR) 10Ottomata: Enable cirrussearch-request logs via eventgate in beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505280 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[19:23:41] <wikibugs>	 (03PS15) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[19:24:23] <wikibugs>	 (03CR) 10Bstorm: "Seems legit https://puppet-compiler.wmflabs.org/compiler1002/15917/cloudstore1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[19:25:25] <wikibugs>	 (03PS16) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[19:27:02] <wikibugs>	 (03PS3) 10EBernhardson: Enable cirrussearch-request logs via eventgate in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505280 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[19:27:57] <wikibugs>	 (03PS17) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[19:28:23] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Enable cirrussearch-request logs via eventgate in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505280 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[19:29:46] <wikibugs>	 (03CR) 10jenkins-bot: Enable cirrussearch-request logs via eventgate in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505280 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[19:31:30] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn) https://wikitech.wikimedia.org/wiki/Phab1003  https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/phab1001.eqiad.wmnet
[19:32:23] <wikibugs>	 (03PS18) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527)
[19:33:35] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[19:34:03] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn)
[19:35:17] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn)
[19:36:28] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: No-op - prep for enabling cirrussearch-request logging in beta (duration: 00m 53s)
[19:36:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:41] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: No-op - enabling cirrussearch-request logging in beta (duration: 00m 53s)
[19:38:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:49] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/CirrusSearch-common.php: No-op - enabling cirrussearch-request logging in beta (duration: 00m 52s)
[19:39:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:52] <wikibugs>	 (03PS1) 10Dzahn: site: turn phab1003 into a (but not the prod) phabricator server [puppet] - 10https://gerrit.wikimedia.org/r/505298 (https://phabricator.wikimedia.org/T221391)
[19:42:41] <wikibugs>	 (03CR) 10Smalyshev: [C: 04-1] "Hold on this - https://phabricator.wikimedia.org/T221407 seems to happen on on internal hosts, so I am suspecting maybe it's related. Unti" [puppet] - 10https://gerrit.wikimedia.org/r/504990 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev)
[19:45:18] <wikibugs>	 (03PS2) 10Dzahn: site: turn phab1003 into a (but not the prod) phabricator server [puppet] - 10https://gerrit.wikimedia.org/r/505298 (https://phabricator.wikimedia.org/T221391)
[19:46:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: turn phab1003 into a (but not the prod) phabricator server [puppet] - 10https://gerrit.wikimedia.org/r/505298 (https://phabricator.wikimedia.org/T221391) (owner: 10Dzahn)
[20:02:09] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn)
[20:07:16] <icinga-wm>	 PROBLEM - Disk space on furud is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[20:08:45] <wikibugs>	 (03PS1) 10Dzahn: base/monitoring: add Icinga notes_url for 'long running screens' alert [puppet] - 10https://gerrit.wikimedia.org/r/505301 (https://phabricator.wikimedia.org/T197873)
[20:09:20] <mutante>	 ottomata: furud ^ seems out of disk. site.pp says it's a hadoop client and backup and to ask you
[20:10:09] <mutante>	 eh, well. mount not accesible 
[20:14:54] <wikibugs>	 (03PS1) 10Dzahn: icinga: remove google safe browsing monitoring [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985)
[20:15:42] <wikibugs>	 (03PS2) 10Dzahn: icinga: remove google safe browsing monitoring [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985)
[20:18:55] <wikibugs>	 (03CR) 10Dzahn: "Ariel: see https://phabricator.wikimedia.org/T30898" [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) (owner: 10Dzahn)
[20:19:30] <wikibugs>	 (03PS3) 10Dzahn: icinga: remove google safe browsing monitoring [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985)
[20:21:27] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update lag - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[20:21:33] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[20:21:48] <wikibugs>	 (03PS3) 10Alex Monk: deployment-prep: Add stretch storage hosts [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503714
[20:24:51] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] icinga: remove google safe browsing monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) (owner: 10Dzahn)
[20:36:47] <wikibugs>	 (03CR) 10Clarakosi: "> Patch Set 1:" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris)
[20:42:19] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update lag - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[20:42:25] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[20:44:22] <wikibugs>	 (03PS18) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto)
[20:45:30] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) Results of caching can be seen here: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?...
[20:53:04] <wikibugs>	 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10debt)
[20:53:14] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10debt) 05Open→03Resolved
[20:57:45] <wikibugs>	 (03PS4) 10Dzahn: icinga: remove google safe browsing monitoring [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985)
[20:57:47] <wikibugs>	 (03CR) 10Dzahn: icinga: remove google safe browsing monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) (owner: 10Dzahn)
[21:04:01] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] icinga: remove google safe browsing monitoring [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) (owner: 10Dzahn)
[21:08:07] <wikibugs>	 (03PS4) 10Dzahn: site/conftool: assign mw2150 jobrunner, mw2244,mw2245 API servers [puppet] - 10https://gerrit.wikimedia.org/r/504794 (https://phabricator.wikimedia.org/T192457)
[21:19:33] <wikibugs>	 (03PS1) 10Dzahn: add fake certs for mw2150,mw2244,mw2245 [labs/private] - 10https://gerrit.wikimedia.org/r/505315 (https://phabricator.wikimedia.org/T192457)
[21:21:17] <wikibugs>	 (03PS2) 10Dzahn: add fake certs for mw2150,mw2244,mw2245 [labs/private] - 10https://gerrit.wikimedia.org/r/505315 (https://phabricator.wikimedia.org/T192457)
[21:23:13] <wikibugs>	 (03PS3) 10Dzahn: add fake certs for mw2150,mw2244,mw2245 [labs/private] - 10https://gerrit.wikimedia.org/r/505315 (https://phabricator.wikimedia.org/T192457)
[21:23:55] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake certs for mw2150,mw2244,mw2245 [labs/private] - 10https://gerrit.wikimedia.org/r/505315 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn)
[21:24:01] <wikibugs>	 (03PS4) 10Dzahn: add fake certs for mw2150,mw2244,mw2245 [labs/private] - 10https://gerrit.wikimedia.org/r/505315 (https://phabricator.wikimedia.org/T192457)
[21:24:46] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake certs for mw2150,mw2244,mw2245 [labs/private] - 10https://gerrit.wikimedia.org/r/505315 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn)
[21:29:47] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db2047 is CRITICAL: CRITICAL: Slot 0: Predictive Failure: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11 - Failed: 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T221481
[21:29:52] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2047 - https://phabricator.wikimedia.org/T221481 (10ops-monitoring-bot)
[21:35:47] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[21:38:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks for making the certs in prod! compiler run works now after adding fake secrets in labs/private. affects only the new hosts, not oth" [puppet] - 10https://gerrit.wikimedia.org/r/504794 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn)
[21:42:52] <mutante>	 !log mw2150,mw2244,mw2245: initial puppet run, added to mw roles
[21:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:11] <wikibugs>	 (03PS2) 10Dzahn: base/monitoring: add Icinga notes_url for 'long running screens' alert [puppet] - 10https://gerrit.wikimedia.org/r/505301 (https://phabricator.wikimedia.org/T197873)
[21:47:20] <wikibugs>	 (03PS3) 10Dzahn: base/monitoring: add Icinga notes_url for 'long running screens' alert [puppet] - 10https://gerrit.wikimedia.org/r/505301 (https://phabricator.wikimedia.org/T197873)
[21:47:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] base/monitoring: add Icinga notes_url for 'long running screens' alert [puppet] - 10https://gerrit.wikimedia.org/r/505301 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[21:49:02] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348 (10Dzahn) created kind of a runbook at  https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens  linked that as notes_url  https://gerrit.wikimedia.org/r/c/operation...
[21:49:57] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[21:51:16] <wikibugs>	 (03Abandoned) 10Dzahn: replace phab1002 with phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/496119 (https://phabricator.wikimedia.org/T215335) (owner: 10Dzahn)
[21:51:49] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: fix up some params around the rsync jobs [puppet] - 10https://gerrit.wikimedia.org/r/505325 (https://phabricator.wikimedia.org/T209527)
[21:54:38] <wikibugs>	 (03PS2) 10Bstorm: cloudstore: fix up some params around the rsync jobs [puppet] - 10https://gerrit.wikimedia.org/r/505325 (https://phabricator.wikimedia.org/T209527)
[21:55:01] <wikibugs>	 (03Abandoned) 10Dzahn: set phab1002 as a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/496116 (https://phabricator.wikimedia.org/T215332) (owner: 10Dzahn)
[21:56:32] <wikibugs>	 (03PS3) 10Dzahn: update mariadb grants from phab1002 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/496120
[21:56:55] <wikibugs>	 (03PS4) 10Dzahn: update mariadb grants from phab1002 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/496120
[21:56:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] update mariadb grants from phab1002 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/496120 (owner: 10Dzahn)
[21:57:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] update mariadb grants from phab1002 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/496120 (owner: 10Dzahn)
[21:57:50] <wikibugs>	 (03PS5) 10Dzahn: update mariadb grants from phab1002 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/496120
[21:58:57] <wikibugs>	 (03PS6) 10Dzahn: update mariadb grants from phab1002 to phab1003 (comments only) [puppet] - 10https://gerrit.wikimedia.org/r/496120
[22:03:26] <wikibugs>	 (03PS8) 10Dzahn: dumps: switch phab1001->phab1003 as phab dumps source [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019)
[22:09:10] <wikibugs>	 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException") - https://phabricator.wikimedia.org/T221347 (10Urbanecm) Shouldn't prio be lowered? If I'm not mistaken, we're in post mortem stage - which can last l...
[22:12:01] <wikibugs>	 (03CR) 10Paladox: dumps: switch phab1001->phab1003 as phab dumps source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn)
[22:12:04] <wikibugs>	 (03PS3) 10Dzahn: switch phabricator from phab1001 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/437620 (https://phabricator.wikimedia.org/T196019)
[22:12:25] <wikibugs>	 (03CR) 10Dzahn: switch phabricator from phab1001 to phab1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437620 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn)
[22:15:19] <wikibugs>	 (03PS9) 10Dzahn: dumps: switch phab1001->phab1003 as phab dumps source [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019)
[22:17:00] <icinga-wm>	 PROBLEM - Check systemd state on mw2244 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:17:26] <icinga-wm>	 PROBLEM - Check systemd state on mw2245 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:17:36] <mutante>	 ^ first puppet run, not pooled
[22:17:42] <chaomodus>	 oic
[22:17:51] <chaomodus>	 was /just about/ to check :)
[22:17:56] <mutante>	 they just ran puppet the first time with the mw role
[22:18:11] <mutante>	 2244,2245 and 2251 .. if any others then that is bad :)
[22:18:24] <chaomodus>	 kay cool
[22:18:25] <mutante>	 sorry, 2151, heh
[22:18:39] <chaomodus>	 so the second digit is like a generational indicator huh
[22:19:06] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[22:19:43] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2150 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T192457
[22:19:43] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2150 is CRITICAL: Host mw2150 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T192457 https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist
[22:19:43] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2244 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T192457
[22:19:43] <icinga-wm>	 ACKNOWLEDGEMENT - DPKG on mw2244 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn https://phabricator.wikimedia.org/T192457
[22:19:43] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2245 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T192457
[22:20:04] <mutante>	 chaomodus: yes, first digit is data center and from there it's just counting up. so yes
[22:20:18] <chaomodus>	 mmhm
[22:20:26] <icinga-wm>	 PROBLEM - HHVM rendering on mw2245 is CRITICAL: connect to address 10.192.0.71 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[22:20:50] <mutante>	 chaomodus: btw, icinga.. if you see "long running screen" ones.. i just wrote a page for that  https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[22:21:04] <chaomodus>	 Ah cool
[22:21:19] <mutante>	 there is one on netmon1002 :)
[22:22:03] <chaomodus>	 haha yah
[22:22:16] <icinga-wm>	 PROBLEM - nutcracker port on mw2150 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[22:22:20] <icinga-wm>	 ACKNOWLEDGEMENT - nutcracker port on mw2150 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused daniel_zahn https://phabricator.wikimedia.org/T192457 https://wikitech.wikimedia.org/wiki/Nutcracker
[22:22:20] <icinga-wm>	 ACKNOWLEDGEMENT - HHVM rendering on mw2245 is CRITICAL: connect to address 10.192.0.71 and port 80: Connection refused daniel_zahn https://phabricator.wikimedia.org/T192457 https://wikitech.wikimedia.org/wiki/Application_servers
[22:22:48] <chaomodus>	 there were *two*
[22:22:52] <chaomodus>	 fixed :)
[22:23:07] <mutante>	 :) thx
[22:24:10] <icinga-wm>	 PROBLEM - nutcracker process on mw2150 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[22:24:35] <mutante>	 setting downtime for more hosts that are in "pending"
[22:24:41] <mutante>	 and would otherwise keep popping up soon
[22:24:48] <mutante>	 eh. services on those hosts 
[22:25:01] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] switch phabricator from phab1001 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/437620 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn)
[22:25:12] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] dumps: switch phab1001->phab1003 as phab dumps source [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn)
[22:25:35] <icinga-wm>	 ACKNOWLEDGEMENT - nutcracker process on mw2150 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker daniel_zahn https://phabricator.wikimedia.org/T192457 https://wikitech.wikimedia.org/wiki/Nutcracker
[22:26:20] <mutante>	 chaomodus: recovery for "long running screen" isnt expected until hours later.. it checks only couple hours.. but i am just clicking it 
[22:26:46] <icinga-wm>	 RECOVERY - Long running screen/tmux on netmon1002 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[22:26:56] <chaomodus>	 👍
[22:26:58] <mutante>	 paladox: thanks
[22:27:07] <paladox>	 your welcome :)
[22:29:08] <icinga-wm>	 RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 80391 bytes in 0.409 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[22:29:41] <wikibugs>	 (03PS3) 10Bstorm: cloudstore: fix up some params around the rsync jobs [puppet] - 10https://gerrit.wikimedia.org/r/505325 (https://phabricator.wikimedia.org/T209527)
[22:29:43] <wikibugs>	 (03PS1) 10Dzahn: update SPF records from phab1001 to phab1003 IP [dns] - 10https://gerrit.wikimedia.org/r/505332 (https://phabricator.wikimedia.org/T221391)
[22:30:51] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: fix up some params around the rsync jobs [puppet] - 10https://gerrit.wikimedia.org/r/505325 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[22:32:14] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn) note:  phab1001 has more IPs on the interface than phab1003, adding the additional ones doesn't look puppetized !!  cc; @20after4  we need to remember this for m...
[22:32:37] <mutante>	 paladox: https://phabricator.wikimedia.org/T221389#5126343  :p
[22:32:56] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10greg)
[22:32:58] <paladox>	 heh
[22:33:19] <mutante>	 3 servers. one has 2, one has 3 and one has 4
[22:33:21] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10greg) >>! In T221389#5126343, @Dzahn wrote: > note:  phab1001 has more IPs on the interface than phab1003, adding the additional ones doesn't look puppetized !! >  > cc...
[22:33:44] <paladox>	 mutante 20after4 is mmodell on phab :) (you may want o ping mmodell instead of 20after4)
[22:34:01] <mutante>	 paladox: thanks! greg just fixed the same thing for me
[22:34:17] <paladox>	 yup
[22:34:43] <greg-g>	 personal vs work account, yeah :)
[22:34:55] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2047 - https://phabricator.wikimedia.org/T221481 (10colewhite) p:05Triage→03High
[22:35:10] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: change version to newton for cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/505333 (https://phabricator.wikimedia.org/T209527)
[22:35:54] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: change version to newton for cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/505333 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[22:39:30] <wikibugs>	 (03PS2) 10Dzahn: update SPF records from phab1001 to phab1003 IP [dns] - 10https://gerrit.wikimedia.org/r/505332 (https://phabricator.wikimedia.org/T221389)
[22:42:12] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[22:44:39] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: furud - DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error - https://phabricator.wikimedia.org/T221483 (10Dzahn)
[22:46:04] <icinga-wm>	 RECOVERY - Disk space on furud is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[22:47:11] <mutante>	 !log furud - remounted /mnt/hdfs for T221483
[22:47:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:17] <stashbot>	 T221483: furud - DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error - https://phabricator.wikimedia.org/T221483
[22:47:25] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: furud - DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error - https://phabricator.wikimedia.org/T221483 (10Dzahn) 05Open→03Resolved a:03Dzahn followed the docs at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administrat...
[22:53:06] <mutante>	 !log mw2244,mw2245,mw2150 - rebooting for known nutcracker issue after first install 
[22:53:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:14] <wikibugs>	 (03CR) 10BryanDavis: "> It looks right to me.  Another set of eyes wouldn't hurt, but I'd" [puppet] - 10https://gerrit.wikimedia.org/r/504817 (https://phabricator.wikimedia.org/T221225) (owner: 10BryanDavis)
[22:53:19] <icinga-wm>	 RECOVERY - Check systemd state on mw2244 is OK: OK - running: The system is fully operational
[22:54:43] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) 17:42 < mutante> !log mw2150,mw2244,mw2245: initial puppet run, added to mw roles  18:53 < mutante> !log mw2244,mw2245,mw2150 - rebooting for known nutcracker issue after first install    18:...
[22:55:13] <icinga-wm>	 RECOVERY - Check systemd state on mw2245 is OK: OK - running: The system is fully operational
[22:55:25] <icinga-wm>	 RECOVERY - nutcracker process on mw2150 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[22:55:33] <mutante>	 !log mw2244,mw2245,mw2150 - scap pull
[22:55:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:55:45] <icinga-wm>	 RECOVERY - nutcracker port on mw2150 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 https://wikitech.wikimedia.org/wiki/Nutcracker
[23:09:01] <wikibugs>	 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 4 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac)
[23:09:13] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: try setting the openstack version differently [puppet] - 10https://gerrit.wikimedia.org/r/505339 (https://phabricator.wikimedia.org/T209527)
[23:10:30] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2150.codfw.wmnet,service=nginx,cluster=jobrunner
[23:10:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:34] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2244.codfw.wmnet,cluster=api_appserver
[23:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:07] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2245.codfw.wmnet,cluster=api_appserver
[23:17:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:36] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn)
[23:18:28] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) 19:10 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2150.codfw.wmnet,service=nginx,cluster=jobrunner  19:16 <+logmsgbot> !log dzahn@cumin1001 conftool a...
[23:18:49] <wikibugs>	 (03PS2) 10Bstorm: cloudstore: try setting the openstack version differently [puppet] - 10https://gerrit.wikimedia.org/r/505339 (https://phabricator.wikimedia.org/T209527)
[23:26:10] <wikibugs>	 10Operations, 10DC-Ops, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) Hi @wiki_willy to move forward with your production shell access please create a SSH key.  See https://wikitech.wikimedia.org/wiki/Production_shell_access#Generating_your_SSH_key where you c...
[23:26:28] <wikibugs>	 (03PS3) 10Bstorm: cloudstore: add python3 clientpackages for stretch [puppet] - 10https://gerrit.wikimedia.org/r/505339 (https://phabricator.wikimedia.org/T209527)
[23:27:57] <wikibugs>	 10Operations, 10DC-Ops, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn)
[23:29:59] <wikibugs>	 10Operations, 10DC-Ops, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) - resolved "Phabricator permissions to see NDA and Ops restricted tickets".   https://phabricator.wikimedia.org/project/profile/29/  was already done by somebody else  https://phabricator.wi...
[23:31:01] <wikibugs>	 (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1002/15923/cloudstore1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/505339 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[23:33:12] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Change ownership of wikimania-program@lists.wikimedia.org - https://phabricator.wikimedia.org/T220641 (10Dzahn) 05Open→03Resolved You should have received the new password a week ago. If not for some reason, please simply reopen the ticket.