[00:00:04] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Reedy) [00:00:51] mutante, yeah. I was told I could do it myself to speed it, but it doesn't seems I can just test puppet on my local machine. [00:00:56] viz_: or is it more and you want your own puppet master [00:01:03] with all the wm env and staff. [00:01:18] viz_: so there are 2 things here.. production access and testing things [00:01:34] maybe in the future, for now I'm just trying to fix that issue. [00:01:36] do you think you want to submit more puppet changes in the future [00:01:40] or is it just about the fonts [00:01:45] ok [00:02:08] so requesting production access for that is overkill [00:02:20] you just need to get somebody to merge it [00:02:36] and for that it needs some +1 on code review [00:02:49] then the person who hits +2 will also run the puppet-merge command afterwards [00:03:05] I see. So I don't have to run puppet-merge. [00:03:10] no, you don't [00:03:32] and if you wanted to test more you could have your own puppet master in a test environment [00:03:39] which you can totally have if you want to [00:03:50] but you probably don't need just to add some font packages to a list [00:03:53] if that's what it is [00:04:02] I was initially bit confused, cause I thought there is already a test environment in production puppet [00:04:15] there is "wmcs" which used to be called "labs" [00:04:22] you can use VMs there and apply puppet roles to them [00:04:38] and there is "beta" which is as special project in that for testing mw config [00:04:39] So people can just request access to these puppets and merge changes there to test it out [00:05:05] you cherry pick changes, not merging into the public repo [00:05:16] yea, they can either apply roles that are on the production puppet master or they can install their own local puppet master becuase then they dont have to wait for code review to test [00:05:38] it's different whether we are talking about puppet changes or mediawiki config changes [00:07:08] if it's about installing Debian packages it's in puppet, if it's about mediawiki-config to use them then it's a separate repository [00:11:13] It appears for that specific font [00:11:19] we may need to backport some package [00:11:37] I have zero clue how to do that. [00:11:43] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:11:47] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:11:49] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:11:51] viztor_: just focus on getting reviewers to +1 your change. for the package and backport question i highly recommened to add Muehlenhoff [00:11:51] from Buster [00:12:04] XioNoX: ^ [00:12:09] PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:12:11] worrying at all? [00:13:21] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:13:23] viztor_: he was on the ticket before, ping him again [00:13:27] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:13:33] alright @ recoveries [00:13:49] I do hope there is way to test puppet changes though. Is there a script/tutorial can just set up the environment? [00:15:01] PROBLEM - BFD status on cr2-knams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:15:25] RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:16:27] 10Operations, 10SRE-Access-Requests: Requesting access to Puppet for Viztor[S] - https://phabricator.wikimedia.org/T229894 (10Viztor) >>! In T229894#5398158, @Dzahn wrote: > Hi @Viztor is the reason for requesting this just that you want a change deployed for T226633? (It seems we need to improve on how to get... [00:16:39] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:16:39] RECOVERY - BFD status on cr2-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:16:49] viztor_: there is lint-checking (easy to install locally from puppet-lint), then there is running CI locally (there is a script for it, requires local Docker), there is "puppet apply" (requires local puppet) and last but not least there is "run your own puppetmaster in a VM" https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster [00:16:54] https://wikitech.wikimedia.org/wiki/Puppet_coding#Testing_a_patch [00:17:49] viztor_: oh, and of course also https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs [00:21:52] !log gerrit2001 - restarting gerrit to apply 528276 [00:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:44] It seems most of these, besides lint-checking of course, require some access level? [00:23:14] viztor_: you can add special tags to the commit message and make a bot compile it on the "puppet-compiler" [00:23:28] by commenting "check experimental" on gerrit [00:24:54] viztor_: running CI locally in docker and running puppet apply locally don't require access, they require installing software though [00:25:26] the access needed for running a master in a VM is requesting a user on wikitech.wikimedia.org wiki [00:25:41] i mean.. making one and then requesting access to a project [00:26:17] the advantage of that is it does not require any local software [00:27:17] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/527922#message-60a04245c7bf887ae2507e0d90ae2fe3aa533cdb I appears jenkins can only be triggered by white-listed user [00:27:59] viztor_: yea, that's different per repo though, there is a special group, you can request to be added [00:28:07] paladox: ^ [00:29:03] mutante: I'm on pto until tomorrow but around if something really bad happens [00:29:48] XioNoX: ok! got you, it looks like it was really short and recovered [00:29:52] thanks [00:29:53] (03PS4) 10MSantos: First version of the wikifeeds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) [00:31:08] Mutante: the whitelist is in integration/config (zuul/layout.yaml) [00:32:01] paladox: thanks! [00:32:26] i wonder if we have a page describing how to request it [00:33:05] I’m not sure if there’s a page on that [00:33:44] viztor_: ^ i think we should just re-purpose your access request. remove "puppetmaster in prod" and add "trusted users in gerrit for CI" [00:33:54] then we go from there? ok? [00:34:47] Yeah. That would be a good starting point [00:35:24] Prod-access would be an overall for packages installation like that. [00:35:56] yes, it would. but you can look at getting a Wikitech user if you are interested [00:36:16] I think I already have Wikitech access? [00:36:31] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 133.9 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [00:36:31] I originally thought there is a production-like environment to debug in. [00:36:38] then the next step is requesting access to a project inside wmcs [00:36:41] existing or new one [00:36:48] *staging [00:36:53] yea, there is [00:36:56] Which project should I request access to? [00:37:01] it depends [00:37:07] what exactly you want to test [00:37:17] Just the font changes really. [00:37:30] probably the one called "puppet" [00:37:41] or your own one to then delete it again [00:39:05] so many projects.. [00:39:24] hehe, indeed. that's why wmcs people are also always trying to find stuff to delete [00:39:46] it's fine to create an instance for one thing and then just destroy it again [00:39:56] better than keeping it around for a long time [00:41:12] viztor_: for details on that then see #wikimedia-cloud [00:48:33] !log restarting gerrit to apply config change 528276 to exclude some projects from github replication [00:48:36] jouncebot: next [00:48:36] In 10 hour(s) and 11 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190807T1100) [00:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:49] thcipriani: ^ [00:49:17] done [00:49:26] paladox: [00:51:31] re-scheduled icinga checks related to that [00:53:29] PROBLEM - puppet last run on schema2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:54:22] (03PS2) 10Dzahn: mediawiki:maintenance: switch translationnotifications to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528296 (https://phabricator.wikimedia.org/T195392) [00:55:10] known, will recover now [00:58:08] Mutante: thanks! [00:59:03] RECOVERY - puppet last run on schema2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:02:06] paladox: yw. good night [01:02:23] And you too :) [01:03:09] (03CR) 10Dzahn: [C: 03+2] mediawiki:maintenance: switch translationnotifications to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528296 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [01:04:58] (03PS1) 10Dzahn: Revert "mediawiki:maintenance: switch translationnotifications to PHP 7.2" [puppet] - 10https://gerrit.wikimedia.org/r/528606 [01:06:38] (03CR) 10Dzahn: [C: 03+2] Revert "mediawiki:maintenance: switch translationnotifications to PHP 7.2" [puppet] - 10https://gerrit.wikimedia.org/r/528606 (owner: 10Dzahn) [01:06:38] Reason for revert: [01:08:27] Reedy: hrm.. "there is email involved to a lot of users and email stuff might change in new version, needs more testing to avoid surprise issue when it runs next Monday or alternatively surprise email to users when running it right now" [01:09:03] need to check for changes with DigestEmailer.php with a smaller user base or so [01:11:56] hrmm, others are bash scripts that set RUNNER=php hardcoded [01:12:07] and then call the actual MWScript.php [01:12:23] so can't just add PHP= version to them without further change [01:14:37] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn) [01:17:22] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn) [01:27:44] (03PS1) 10Dzahn: mediawiki:maintenance: switch pagetriage cron to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528609 (https://phabricator.wikimedia.org/T195392) [01:27:48] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn) [01:42:06] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn) [01:55:27] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn) [02:08:16] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn) [02:13:59] (03PS1) 10Dzahn: mediawiki:maintenance: switch db_lag_stats to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528612 (https://phabricator.wikimedia.org/T195392) [02:16:39] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn) [02:20:00] (03CR) 10Dzahn: [C: 03+2] "tested on mwmaint1001 - works the same as before - also https://phabricator.wikimedia.org/T149210 https://gerrit.wikimedia.org/r/c/operat" [puppet] - 10https://gerrit.wikimedia.org/r/528612 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [02:20:45] (03CR) 10Dzahn: "switched to PHP 7.2 in https://gerrit.wikimedia.org/r/c/operations/puppet/+/528612 for https://phabricator.wikimedia.org/T195392" [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [02:36:06] (03PS1) 10Dzahn: mw::maintenance: switch foreachwikiindblist to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528613 (https://phabricator.wikimedia.org/T195392) [02:36:48] (03PS2) 10Dzahn: scap/mw-maint: switch foreachwikiindblist to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528613 (https://phabricator.wikimedia.org/T195392) [02:39:20] 10Operations, 10SRE-Access-Requests: Requesting access to Puppet for Viztor[S] - https://phabricator.wikimedia.org/T229894 (10Dzahn) We talked on IRC about this and agreed this ticket should be re-purposed away from "production access to puppetmaster" and to "add to trusted users for CI in gerrit" and possibly... [03:06:44] (03PS1) 10Dzahn: parsoid-testing: add Hiera switch between parsoid/JS and parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528615 (https://phabricator.wikimedia.org/T229363) [04:40:29] (03PS1) 10BryanDavis: toolforge: add CORS header to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/528617 [04:44:32] (03CR) 10BryanDavis: "My concrete use case for this is setting up https://github.com/Joxit/docker-registry-ui as a Toolforge tool to make registry browsing a bi" [puppet] - 10https://gerrit.wikimedia.org/r/528617 (owner: 10BryanDavis) [05:02:17] (03PS1) 10KartikMistry: Update cxserver to 2019-08-06-100812-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/528618 (https://phabricator.wikimedia.org/T227571) [05:36:05] 10Operations, 10ops-codfw, 10DBA: (2019-08-31)rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10Marostegui) Thanks! The alert cleared: ` ------------------------------------------------------------------------------- Record: 4 Date/Time: 08/06/2019 14:53:16 Source:... [05:37:29] 10Operations, 10ops-codfw, 10DBA: (2019-08-31)rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10Marostegui) 05Open→03Resolved [05:39:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1100 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P8875 and previous config saved to /var/cache/conftool/dbconfig/20190807-053903-marostegui.json [05:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:03] (03PS1) 10Marostegui: mariadb: Decommission db1071 [puppet] - 10https://gerrit.wikimedia.org/r/528619 (https://phabricator.wikimedia.org/T229381) [05:46:39] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1071 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528620 (https://phabricator.wikimedia.org/T229381) [05:47:45] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db1071 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528620 (https://phabricator.wikimedia.org/T229381) (owner: 10Marostegui) [05:48:48] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1071 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528620 (https://phabricator.wikimedia.org/T229381) (owner: 10Marostegui) [05:49:03] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1071 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528620 (https://phabricator.wikimedia.org/T229381) (owner: 10Marostegui) [05:50:19] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db1071 from config T229381 (duration: 00m 57s) [05:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:27] T229381: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 [05:51:06] 10Operations, 10decommission, 10Patch-For-Review: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Marostegui) [05:51:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1071 from config T229381 (duration: 00m 55s) [05:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1071 [puppet] - 10https://gerrit.wikimedia.org/r/528619 (https://phabricator.wikimedia.org/T229381) (owner: 10Marostegui) [05:53:43] 10Operations, 10decommission, 10Patch-For-Review: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Marostegui) [05:55:20] !log Remove db1071 from tendril and zarcillo - T229381 [05:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:29] T229381: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 [05:57:34] !log Stop MySQL on db1071 - T229381 [05:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Marostegui) a:05Marostegui→03RobH [05:59:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Marostegui) This host is ready for #dc-ops to decommission [06:00:06] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:04:33] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2130 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528623 (https://phabricator.wikimedia.org/T228969) [06:07:33] (03CR) 10Vgutierrez: [C: 03+1] db-eqiad,db-codfw.php: Add db2130 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528623 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [06:13:37] (03PS1) 10Vgutierrez: acme_chief cloud: Ensure that python3-designateclient is installed [puppet] - 10https://gerrit.wikimedia.org/r/528624 [06:13:52] (03CR) 10Elukey: profile::cache::kafka::alerts: move alarms to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526611 (https://phabricator.wikimedia.org/T229357) (owner: 10Elukey) [06:16:01] (03PS5) 10Elukey: profile::cache::kafka::alerts: move alarms to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/526611 (https://phabricator.wikimedia.org/T229357) [06:18:46] (03CR) 10Elukey: [C: 03+2] profile::cache::kafka::alerts: move alarms to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/526611 (https://phabricator.wikimedia.org/T229357) (owner: 10Elukey) [06:28:53] (03PS1) 10Elukey: profile::mediawiki::webserver: remove hhvm-restart for php-only hosts [puppet] - 10https://gerrit.wikimedia.org/r/528629 [06:30:32] (03PS2) 10Muehlenhoff: Icinga: Remove support for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/528391 [06:32:21] (03CR) 10Muehlenhoff: [C: 03+2] Icinga: Remove support for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/528391 (owner: 10Muehlenhoff) [06:33:09] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17767/mw1348.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/528629 (owner: 10Elukey) [06:35:53] 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10elukey) >>! In T151304#5396403, @Andrew wrote: > So that suggests that it's probably wise to keep using it on cloud VMs, as long as it still works. That said, I'm not sure that we couldn't just > /dev/... [06:47:27] (03PS1) 10Elukey: profile::cache::kafka::varnish_kafka_delivery_alerts: fix query [puppet] - 10https://gerrit.wikimedia.org/r/528636 (https://phabricator.wikimedia.org/T229357) [06:49:26] (03CR) 10Elukey: [C: 03+2] profile::cache::kafka::varnish_kafka_delivery_alerts: fix query [puppet] - 10https://gerrit.wikimedia.org/r/528636 (https://phabricator.wikimedia.org/T229357) (owner: 10Elukey) [06:51:19] (03PS1) 10Muehlenhoff: varnish: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/528637 [06:55:08] (03PS1) 10Elukey: profile::cache::kafka::varnishkafka_delivery_alerts: fix dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/528638 (https://phabricator.wikimedia.org/T229357) [06:58:02] (03PS2) 10Elukey: profile::cache::kafka::varnishkafka_delivery_alerts: fix dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/528638 (https://phabricator.wikimedia.org/T229357) [07:00:09] (03CR) 10Muehlenhoff: [C: 03+1] Add Sukhbir Singh (sukhe) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/528585 (https://phabricator.wikimedia.org/T229860) (owner: 10Ssingh) [07:02:08] (03CR) 10Elukey: [C: 03+2] profile::cache::kafka::varnishkafka_delivery_alerts: fix dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/528638 (https://phabricator.wikimedia.org/T229357) (owner: 10Elukey) [07:15:27] (03PS2) 10Muehlenhoff: Remove old jessie-based pool counters [puppet] - 10https://gerrit.wikimedia.org/r/525093 (https://phabricator.wikimedia.org/T224572) [07:17:22] (03CR) 10Mathew.onipe: "PCC output is Ok: https://puppet-compiler.wmflabs.org/compiler1002/17769/" [puppet] - 10https://gerrit.wikimedia.org/r/528491 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [07:17:52] (03CR) 10Volans: netbox: redirect swagger doc requests to official docs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528531 (owner: 10CRusnov) [07:20:46] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Add db2130 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528623 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:21:55] (03PS1) 10Marostegui: db2130: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/528639 (https://phabricator.wikimedia.org/T228969) [07:22:09] (03CR) 10Muehlenhoff: [C: 03+2] Remove old jessie-based pool counters [puppet] - 10https://gerrit.wikimedia.org/r/525093 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [07:22:55] (03PS2) 10Marostegui: db2130: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/528639 (https://phabricator.wikimedia.org/T228969) [07:23:58] (03CR) 10Marostegui: [C: 03+2] db2130: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/528639 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:25:59] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2130 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528623 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:26:30] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2130 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528623 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:27:23] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db2130 into s1 T228969 (duration: 00m 55s) [07:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:33] T228969: Productionize db21[21-31} - https://phabricator.wikimedia.org/T228969 [07:28:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Provision db2130 into s1 T228969 (duration: 00m 56s) [07:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:10] PROBLEM - Check systemd state on db1100 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:29:22] PROBLEM - Check whether ferm is active by checking the default input chain on db1100 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:29:47] ^ checking that [07:30:38] RECOVERY - Check whether ferm is active by checking the default input chain on db1100 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:31:42] RECOVERY - Check systemd state on db1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:57] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [07:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:32:11] 10Operations, 10serviceops, 10Patch-For-Review: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `poolcounter1001.eqiad.wmnet` - poolcounter1001.eqiad.wmnet - Removed from Puppet... [07:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:37] (03PS1) 10Gergő Tisza: Unset Allow-Credentials for publichtml [puppet] - 10https://gerrit.wikimedia.org/r/528645 [07:33:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1100 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P8876 and previous config saved to /var/cache/conftool/dbconfig/20190807-073349-marostegui.json [07:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:40] (03CR) 10Gergő Tisza: "Done in I634bd84e8d." [puppet] - 10https://gerrit.wikimedia.org/r/522991 (https://phabricator.wikimedia.org/T224068) (owner: 10Gergő Tisza) [07:36:33] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [07:36:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:48] 10Operations, 10serviceops, 10Patch-For-Review: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `poolcounter1003.eqiad.wmnet` - poolcounter1003.eqiad.wmnet - Removed from Puppet... [07:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:42] (03CR) 10Ema: [C: 03+1] "Confirmed PCC noop: https://puppet-compiler.wmflabs.org/compiler1002/17770/" [puppet] - 10https://gerrit.wikimedia.org/r/528637 (owner: 10Muehlenhoff) [07:39:02] (03PS1) 10Marostegui: mariadb: Provision db2131 into x1 [puppet] - 10https://gerrit.wikimedia.org/r/528668 (https://phabricator.wikimedia.org/T228969) [07:41:33] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Eldarado) @Dzahn Excuse me for my false, I updated the information below. Thanks in advance. * requested name of the mailing list, ending in @lists.wikimedia.org. Wikimedia-AZ@lists.wik... [07:41:52] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2131 into x1 [puppet] - 10https://gerrit.wikimedia.org/r/528668 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:42:40] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:43:07] (03PS2) 10Ema: ATS: use TLS to connect to analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/524482 (https://phabricator.wikimedia.org/T210411) [07:44:35] (03CR) 10Ema: [C: 03+2] ATS: use TLS to connect to analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/524482 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [07:48:41] 10Operations: install2002 short on disk space - https://phabricator.wikimedia.org/T229997 (10Marostegui) p:05Triage→03Normal [07:53:48] (03CR) 10Elukey: "Left a comment but the httpd rewriterules looks good in my opinion (modulo the removed proxypass URIs, I assume that those are ok, don't h" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528521 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [07:55:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10ema) 05Open→03Resolved Thank you so much @elukey! ATS is now using TLS only for connections to #analytics origins. [07:56:27] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [07:58:58] (03PS1) 10Ema: ATS: use TLS to connect to matomo [puppet] - 10https://gerrit.wikimedia.org/r/528704 (https://phabricator.wikimedia.org/T210411) [08:00:43] (03CR) 10Elukey: [C: 03+1] ATS: use TLS to connect to matomo [puppet] - 10https://gerrit.wikimedia.org/r/528704 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:01:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2130 into s1 - T228969', diff saved to https://phabricator.wikimedia.org/P8877 and previous config saved to /var/cache/conftool/dbconfig/20190807-080059-marostegui.json [08:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:08] T228969: Productionize db21[21-31} - https://phabricator.wikimedia.org/T228969 [08:01:09] (03CR) 10Ema: [C: 03+2] ATS: use TLS to connect to matomo [puppet] - 10https://gerrit.wikimedia.org/r/528704 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:02:06] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "see the comment inline" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) (owner: 10Subramanya Sastry) [08:05:57] 10Operations, 10observability: Remove logster from cp* hosts - https://phabricator.wikimedia.org/T229357 (10elukey) From my point of view logster etc.. on the cp hosts can be removed! [08:09:59] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [08:12:42] (03PS7) 10Effie Mouzeli: profile::mediawiki::hhvm: default php to php7 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [08:14:00] PROBLEM - puppet last run on poolcounter2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:14:10] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::hhvm: default php to php7 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [08:18:00] 10Operations: decom cookbook: dry-run mode not working / PuppetDB removal failed - https://phabricator.wikimedia.org/T229998 (10MoritzMuehlenhoff) [08:18:07] (03PS2) 10Filippo Giunchedi: Unset Allow-Credentials for publichtml [puppet] - 10https://gerrit.wikimedia.org/r/528645 (owner: 10Gergő Tisza) [08:21:55] (03CR) 10Filippo Giunchedi: [C: 03+2] Unset Allow-Credentials for publichtml [puppet] - 10https://gerrit.wikimedia.org/r/528645 (owner: 10Gergő Tisza) [08:24:50] RECOVERY - puppet last run on poolcounter2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:53] (03CR) 10Marostegui: [C: 03+1] ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [08:35:11] 10Operations: decom cookbook: dry-run mode not working / PuppetDB removal failed - https://phabricator.wikimedia.org/T229998 (10MoritzMuehlenhoff) After running the deactivate step a second time, poolcounter1003 got correctly removed. Looking at PuppetDB logs there might be some kind of race in PuppetDB: ` jmm@... [08:37:10] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [08:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:37:22] 10Operations, 10serviceops: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `poolcounter2001.codfw.wmnet` - poolcounter2001.codfw.wmnet - Removed from Puppet master and PuppetDB... [08:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:22] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2131 into x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528713 (https://phabricator.wikimedia.org/T228969) [08:42:03] (03PS1) 10Ema: ATS: use TLS for thorium, dbmonitor, netmon [puppet] - 10https://gerrit.wikimedia.org/r/528715 (https://phabricator.wikimedia.org/T210411) [08:42:24] (03PS1) 10Vgutierrez: Backport prefetched OCSP stapling responses [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) [08:44:06] 10Operations: decom cookbook: dry-run mode not working / PuppetDB removal failed - https://phabricator.wikimedia.org/T229998 (10Volans) @MoritzMuehlenhoff the dry-run mode is passed to all modules and all modules must implement unless they only do RO actions. The difference in logging is because dry-run sets aut... [08:45:03] (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528713 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [08:45:22] (03CR) 10Vgutierrez: [C: 03+1] db-eqiad,db-codfw.php: Pool db2131 into x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528713 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [08:45:43] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Pool db2131 into x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528713 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [08:46:17] (03CR) 10jerkins-bot: [V: 04-1] Backport prefetched OCSP stapling responses [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) (owner: 10Vgutierrez) [08:46:23] of course [08:46:24] :_) [08:46:42] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2131 into x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528713 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [08:46:57] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2131 into x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528713 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [08:47:14] (03CR) 10Ema: [C: 03+2] ATS: use TLS for thorium, dbmonitor, netmon [puppet] - 10https://gerrit.wikimedia.org/r/528715 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:48:17] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [08:48:25] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db2131 into x1 T228969 (duration: 00m 56s) [08:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:33] T228969: Productionize db21[21-31} - https://phabricator.wikimedia.org/T228969 [08:48:45] (03PS2) 10Vgutierrez: Backport prefetched OCSP stapling responses [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) [08:49:27] (03CR) 10Gehel: [C: 04-1] Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [08:49:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Provision db2131 into x1 T228969 (duration: 00m 55s) [08:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:48] (03PS8) 10Effie Mouzeli: profile::mediawiki::hhvm: default php to php7 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [08:51:12] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::hhvm: default php to php7 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [08:53:53] (03CR) 10jerkins-bot: [V: 04-1] Backport prefetched OCSP stapling responses [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) (owner: 10Vgutierrez) [08:54:33] 10Operations: decom cookbook: dry-run mode not working / PuppetDB removal failed - https://phabricator.wikimedia.org/T229998 (10MoritzMuehlenhoff) There's more: Next I ran the cook book for a host for which the dry-run mode had not been used on previously (to rule out that the incomplete dry-run skews the effect... [08:56:47] (03PS3) 10Vgutierrez: Backport prefetched OCSP stapling responses [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) [08:56:48] moritzm: the puppetdb stuff is the queue, see my link to grafana [08:56:58] it needs investigation, but I really can't right now [08:58:07] the debmonitor stuff is the usual, the host is not (yet) powerdown and the puppet crontab runs every 30m, running apt-get update and sending the upgradable packages to debmonitor [08:58:30] it will all be fixed with the action item from the SRE summit to add the dd+shutdown to this cookbook [08:58:34] just ENOTIME to do it so far [08:58:36] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:58:52] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:59:11] (03PS1) 10Marostegui: db2131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/528717 (https://phabricator.wikimedia.org/T228969) [08:59:44] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:59:51] (03PS2) 10Marostegui: db2131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/528717 (https://phabricator.wikimedia.org/T228969) [08:59:52] mutante: the iframe with puppetboard requires login, not sure if we should keep it in the grafana dashboard [09:00:22] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) Looks good to me (followed up only on the codfw task). Can we get them repurposed? [09:01:42] (03CR) 10Marostegui: [C: 03+2] db2131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/528717 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [09:06:30] (03PS1) 10Filippo Giunchedi: base: don't CRITICAL on per-host puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/528719 (https://phabricator.wikimedia.org/T229262) [09:06:34] (03PS1) 10Arturo Borrero Gonzalez: keystone: allow svc account traffic-cloud-dns-manager to use password auth [puppet] - 10https://gerrit.wikimedia.org/r/528720 (https://phabricator.wikimedia.org/T229786) [09:08:02] (03PS1) 10Ema: Add discovery CNAME webserver-misc-static -> bromine [dns] - 10https://gerrit.wikimedia.org/r/528721 (https://phabricator.wikimedia.org/T210411) [09:08:42] gerrit in guru meditation for anyone else ? [09:09:23] godog: same here, very slow [09:09:57] funny enough.. is working nice for me right now [09:10:21] even with my ~300ms roundtrip to gerrit.wm.i [09:10:42] 10Operations: decom cookbook: dry-run mode not working / PuppetDB removal failed - https://phabricator.wikimedia.org/T229998 (10MoritzMuehlenhoff) The removal in Debmonitor has a similar race to the PuppetDB removal: I seem to be really lucky, hitting two different races in two subsequent decom runs :-) ` jmm@... [09:10:56] 10Operations: decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail - https://phabricator.wikimedia.org/T229998 (10MoritzMuehlenhoff) [09:12:14] moritzm: see above, they don't fail [09:12:29] sorry, in the middle of something else, I'll reply to the task asap [09:13:57] arturo: back for me [09:14:12] godog: yeah, kind of [09:14:19] (03CR) 10Filippo Giunchedi: "One of multiple approaches to achieve the same effect, let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/528719 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [09:14:32] volans: yeah, they fail, but get re-added due to races, no hurry at all, just wanted to open a task to write down my findings [09:14:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] keystone: allow svc account traffic-cloud-dns-manager to use password auth [puppet] - 10https://gerrit.wikimedia.org/r/528720 (https://phabricator.wikimedia.org/T229786) (owner: 10Arturo Borrero Gonzalez) [09:14:47] arturo: my +1 is still on the way [09:14:48] :_( [09:14:53] !log Drop math table from s6 - T196055 [09:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:01] T196055: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055 [09:15:03] already merged vgutierrez [09:15:03] the UI is frozen checking if PS1 is the latest PS or not [09:15:05] what a a mess [09:15:18] vgutierrez: that's because you're in the future :-P [09:15:19] * volans hides [09:15:26] moritzm: ack [09:15:36] damn.. I just wrote /exec -o date in my irccloud tab [09:15:43] of course I didn't get the expected output [09:15:55] Wed Aug 7 16:15:51 +07 2019 [09:15:59] yeah.. I'm in the future [09:16:30] BTW, the timezone string for the Indochina timezone is pretty ugly [09:16:34] "+07" [09:17:54] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:19:25] (03CR) 10Mathew.onipe: Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [09:20:36] (03PS1) 10Marostegui: maintain-views.yaml: Remove math table [puppet] - 10https://gerrit.wikimedia.org/r/528724 (https://phabricator.wikimedia.org/T196055) [09:22:46] 10Operations: decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail - https://phabricator.wikimedia.org/T229998 (10Volans) The solution that was agreed at the SRE summit for this is to add a `dd` to override the bootloader(s) so that the host cannot boot anymore and perform a shutd... [09:23:25] (03PS1) 10Ema: webserver-misc-static: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/528725 (https://phabricator.wikimedia.org/T210411) [09:23:40] (03CR) 10Marostegui: "bstorm this table will be deleted in production, I guess all I need to remove it from the views is to delete from here and then run mainta" [puppet] - 10https://gerrit.wikimedia.org/r/528724 (https://phabricator.wikimedia.org/T196055) (owner: 10Marostegui) [09:31:24] 10Operations: puppetdb queue size went up since July 30 - https://phabricator.wikimedia.org/T230002 (10MoritzMuehlenhoff) [09:32:15] (03PS1) 10Ema: secret: dummy key for webserver-misc-static [labs/private] - 10https://gerrit.wikimedia.org/r/528726 (https://phabricator.wikimedia.org/T210411) [09:32:59] (03CR) 10Volans: "I'm sorry, I'm a bit behind on things and I'll be out for the rest of the week. For I'll not be able to have a second pass before next wee" [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [09:34:12] (03CR) 10Ema: [C: 03+2] webserver-misc-static: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/528725 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:34:51] (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for webserver-misc-static [labs/private] - 10https://gerrit.wikimedia.org/r/528726 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:41:47] (03CR) 10Filippo Giunchedi: "Personal preference / nit, I find unified diffs more readable" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/528586 (owner: 10Cwhite) [09:46:38] (03PS2) 10Elukey: profile::mediawiki::webserver: remove hhvm-restart for php-only hosts [puppet] - 10https://gerrit.wikimedia.org/r/528629 [09:48:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::mediawiki::webserver: remove hhvm-restart for php-only hosts [puppet] - 10https://gerrit.wikimedia.org/r/528629 (owner: 10Elukey) [09:50:48] (03CR) 10Elukey: [C: 03+2] profile::mediawiki::webserver: remove hhvm-restart for php-only hosts [puppet] - 10https://gerrit.wikimedia.org/r/528629 (owner: 10Elukey) [09:51:07] 10Operations: puppetdb queue size went up since July 30 - https://phabricator.wikimedia.org/T230002 (10jbond) Seems to correlate well with when i [[ https://github.com/wikimedia/puppet/commit/96fe4d5fd633f7322313a459d04d1609247a722b | enabled the canary puppet master ]] and when [[https://github.com/wikimedia/pu... [09:55:01] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:00:22] vgutierrez, out of curiosity, what makes it ugly? (the whole timestamp seems unusually formatted to me) [10:00:59] so back in Spain I'd get a proper string for the timezone... CET/CEST [10:03:08] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [10:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [10:03:19] 10Operations, 10serviceops: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `poolcounter2002.codfw.wmnet` - poolcounter2002.codfw.wmnet - Removed from Puppet master and PuppetDB... [10:03:21] ah. that. expressing timezone as offset from UTC is generally preferred, as it's unambiguous as to what the tz is. those timezone names can be hard to understand for those not in the tz, are not necessarily unique world-wide, and the offset they indicate changes at the whim of governments [10:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:30] 10Operations, 10netops: BGP session down for AS 20485 on cr2-esams - https://phabricator.wikimedia.org/T230004 (10elukey) p:05Triage→03Normal [10:04:41] 10Operations, 10Wiki-Setup (Delete / Redirect): Merge or delete grantswiki - https://phabricator.wikimedia.org/T229950 (10MarcoAurelio) [10:07:33] (03PS1) 10MarcoAurelio: [WIP] mediawiki:maintenance::purge_checkuser.pp: Switch to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528730 (https://phabricator.wikimedia.org/T195392) [10:09:38] (03PS2) 10MarcoAurelio: [WIP] mediawiki:maintenance::purge_checkuser.pp: Switch to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528730 (https://phabricator.wikimedia.org/T195392) [10:10:41] (03PS3) 10MarcoAurelio: [WIP] mediawiki:maintenance::purge_checkuser.pp: Switch to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528730 (https://phabricator.wikimedia.org/T195392) [10:10:48] (03PS1) 10Filippo Giunchedi: prometheus: start collecting mediawiki aggregated stats [puppet] - 10https://gerrit.wikimedia.org/r/528733 (https://phabricator.wikimedia.org/T228878) [10:11:01] !log deleting poolcounter1001, poolcounter1003, poolcounter2001, poolcounter2002 in Ganeti (T224572) [10:11:02] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/528730 (https://phabricator.wikimedia.org/T195392) (owner: 10MarcoAurelio) [10:11:12] (03PS1) 10Marostegui: mariadb: Productionize dbproxy2003 into m3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/528734 (https://phabricator.wikimedia.org/T202367) [10:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:13] T224572: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 [10:11:29] 10Operations: puppetdb queue size went up since July 30 - https://phabricator.wikimedia.org/T230002 (10jbond) It seems when that the severs using the new puppet master cause the following stack trace when they try reach the 'store report' phase ` 2019-08-07 10:04:55,104 ERROR [p.p.threadpool] Error processing c... [10:12:47] (03CR) 10MarcoAurelio: "Perhaps we should enable logging of the runs again so we can test this script works as expected on PHP 7? This script needs to be running " [puppet] - 10https://gerrit.wikimedia.org/r/528730 (https://phabricator.wikimedia.org/T195392) (owner: 10MarcoAurelio) [10:15:41] (03CR) 10MarcoAurelio: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/251/" [puppet] - 10https://gerrit.wikimedia.org/r/528730 (https://phabricator.wikimedia.org/T195392) (owner: 10MarcoAurelio) [10:16:48] 10Operations, 10netops: BGP session down for AS4739 on cr4-ulsfo - https://phabricator.wikimedia.org/T230005 (10elukey) p:05Triage→03Normal [10:18:24] (03PS4) 10MarcoAurelio: mediawiki::maintenance::purge_checkuser.pp: Switch to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528730 (https://phabricator.wikimedia.org/T195392) [10:19:11] (03PS2) 10Marostegui: mariadb: Productionize dbproxy2003 into m3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/528734 (https://phabricator.wikimedia.org/T202367) [10:23:25] (03PS4) 10Vgutierrez: Backport prefetched OCSP stapling responses [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) [10:24:00] 10Puppet, 10Patch-For-Review: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10jbond) p:05Triage→03Normal [10:24:52] 10Operations: puppetdb queue size went up since July 30 - https://phabricator.wikimedia.org/T230002 (10jbond) [10:24:55] 10Puppet, 10Patch-For-Review: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10jbond) [10:25:11] (03PS1) 10Jbond: puppetmaster1003: offline this puppetmaster as its scheme is incompatible [puppet] - 10https://gerrit.wikimedia.org/r/528744 (https://phabricator.wikimedia.org/T228657) [10:25:47] (03PS2) 10Jbond: puppetmaster1003: offline this puppetmaster as its scheme is incompatible [puppet] - 10https://gerrit.wikimedia.org/r/528744 (https://phabricator.wikimedia.org/T228657) [10:27:02] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: offline this puppetmaster as its scheme is incompatible [puppet] - 10https://gerrit.wikimedia.org/r/528744 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [10:27:12] (03PS3) 10Marostegui: mariadb: Productionize dbproxy2003 into m3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/528734 (https://phabricator.wikimedia.org/T202367) [10:32:25] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:35:57] 10Operations, 10Patch-For-Review: puppetdb queue size went up since July 30 - https://phabricator.wikimedia.org/T230002 (10jbond) I have disabled puppetmaster1003 for now, unfortunately from reading the [[https://tickets.puppetlabs.com/browse/PUP-8901 | PUP-8901]] It seems the advice from puppetlabs is to alwa... [10:37:22] (03CR) 10jerkins-bot: [V: 04-1] Backport prefetched OCSP stapling responses [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) (owner: 10Vgutierrez) [10:47:53] (03PS3) 10Jbond: puppetmaster::frontend: update web conf to use RewriteRules instead of proxypass [puppet] - 10https://gerrit.wikimedia.org/r/528521 (https://phabricator.wikimedia.org/T228657) [10:47:58] (03PS1) 10Muehlenhoff: Remove DNS entries for poolcounter100[13] and poolcounter[12] [dns] - 10https://gerrit.wikimedia.org/r/528747 (https://phabricator.wikimedia.org/T224572) [10:48:43] (03CR) 10Jbond: puppetmaster::frontend: update web conf to use RewriteRules instead of proxypass (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528521 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [10:49:12] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::frontend: update web conf to use RewriteRules instead of proxypass [puppet] - 10https://gerrit.wikimedia.org/r/528521 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [10:49:41] (03PS5) 10Vgutierrez: Backport prefetched OCSP stapling responses [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) [10:52:11] (03PS4) 10Jbond: puppetmaster::frontend: update web conf to use RewriteRules instead of proxypass [puppet] - 10https://gerrit.wikimedia.org/r/528521 (https://phabricator.wikimedia.org/T228657) [10:52:50] (03CR) 10jerkins-bot: [V: 04-1] Backport prefetched OCSP stapling responses [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) (owner: 10Vgutierrez) [10:55:47] (03PS6) 10Vgutierrez: Backport prefetched OCSP stapling responses [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) [10:59:11] (03PS2) 10Pmiazga: Enable AMC on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528458 (https://phabricator.wikimedia.org/T228916) [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190807T1100). Please do the needful. [11:00:05] raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] o/ [11:00:23] o/ I might have a patch to deploy [11:00:28] I'm around, looks like my patch is the only one [11:00:28] o/ [11:00:47] raynor: can you deploy your own patch? [11:00:49] I can deploy mine [11:00:51] on it [11:01:37] merging [11:02:21] is gerrit down for anyone? gerrit.wikimedia.org took too long to respond. [11:02:27] yes, same here [11:02:39] same here as well [11:03:45] isn't icinga monitoring gerrit? I would have expect an alert by now [11:04:08] I was still able to look at raynor’s patch 2 minutes ago [11:04:13] but now it’s hanging, yeah [11:04:37] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:04:48] here we go [11:05:13] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [11:05:41] it seemed to me that patches are visible but anything "searchy" timed out [11:05:55] SSH access also still seems to work [11:06:10] colleague sitting next to me was just able to push a new patch, apparently [11:07:46] I am checking gerrit logs on cobalt [11:07:51] But I am tempted to give it a restart [11:09:38] !log Restart gerrit [11:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:48] gerrit is back up [11:12:02] (03CR) 10Pmiazga: [C: 03+2] Enable AMC on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528458 (https://phabricator.wikimedia.org/T228916) (owner: 10Pmiazga) [11:12:07] thank you marostegui [11:12:23] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.076 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:13:01] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26087 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [11:13:35] raynor: yw! [11:13:57] https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=activeThreads [11:13:58] The thread problem happened again [11:14:15] that restart could mean we'll lose some CI jobs? [11:14:29] I'm asking cause my 11 minutes CI job is not showing anywhere :) [11:14:40] (building ATS is amazing) [11:14:50] (03CR) 10Santhosh: [C: 03+2] Update cxserver to 2019-08-06-100812-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/528618 (https://phabricator.wikimedia.org/T227571) (owner: 10KartikMistry) [11:15:01] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:15:17] paladox: is that a known problem? I knew there was a memory issue with gerrit from time to time on cobalt, is that the problem? [11:15:21] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:15:25] PROBLEM - puppet last run on schema2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:15:39] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:15:52] vgutierrez: it shouldn't I think, CI has been slow the whole morning today https://grafana.wikimedia.org/d/000000322/zuul-gearman?from=now-24h&to=now&orgId=1 [11:16:02] (03CR) 10Santhosh: [V: 03+2 C: 03+2] Update cxserver to 2019-08-06-100812-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/528618 (https://phabricator.wikimedia.org/T227571) (owner: 10KartikMistry) [11:16:05] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:16:09] basically what I've lost is the comment back on the gerrit CR [11:16:21] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:16:21] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:16:21] the build is happy here https://integration.wikimedia.org/ci/job/debian-glue/1530/ [11:16:26] (03Merged) 10jenkins-bot: Enable AMC on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528458 (https://phabricator.wikimedia.org/T228916) (owner: 10Pmiazga) [11:16:31] vgutierrez: maybe it was done while gerrit was sort of down but not fully? [11:16:37] Amir1, FYI: I'm waiting for merge [11:16:39] no idea about its internals really, I am just guessing :) [11:16:41] (03CR) 10jenkins-bot: Enable AMC on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528458 (https://phabricator.wikimedia.org/T228916) (owner: 10Pmiazga) [11:16:43] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:16:44] marostegui: probably :) [11:17:05] raynor: cool, let me know when you're done [11:17:26] merged, deploying to mwdebug1002 and testing, We will need like 10 mins I think [11:17:45] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:18:33] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:18:33] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI Composer] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:18:49] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 7 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikibase/wikiba.se-deploy],Exec[git_pull_research/landing-page] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:19:15] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:19:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/528581 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [11:19:36] Marostegui: yeh, it’s known [11:19:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entries for poolcounter100[13] and poolcounter[12] [dns] - 10https://gerrit.wikimedia.org/r/528747 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [11:19:45] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:20:00] paladox: any task where you guys are keeping track of this so it can be reported that it happened today too? [11:21:57] I think there’s one [11:21:58] * paladox searches [11:22:03] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [11:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:07] Marostegui: [11:23:08] https://phabricator.wikimedia.org/T224448 [11:23:14] 10Operations, 10serviceops, 10Patch-For-Review: Migrate pool counters to Buster - https://phabricator.wikimedia.org/T224572 (10MoritzMuehlenhoff) [11:23:21] paladox: thanks! [11:23:23] I will comment [11:24:58] 10Operations, 10serviceops, 10Patch-For-Review: Migrate pool counters to Buster - https://phabricator.wikimedia.org/T224572 (10MoritzMuehlenhoff) 05Open→03Resolved We now have the main pool counters running on Buster using the stock Debian package of poolcounter (poolcounter1004, poolcounter1005, poolcou... [11:25:04] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:26:13] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:26:17] Amir1 - my patch is good, syncing to prod [11:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:27] you said you want to deploy sth [11:26:40] yeah [11:26:41] thanks! [11:26:57] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:528458|Enable AMC on all wikipedias (T228916)]] (duration: 00m 55s) [11:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:06] T228916: Deploy AMC to all Wikipedias - https://phabricator.wikimedia.org/T228916 [11:27:31] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:27:43] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:28:10] Amir1, once you're done, please close the SWAT window [11:28:17] Sure [11:28:45] (03PS2) 10Ladsgroup: Switch property terms migration to WRITE_NEW on client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527087 (https://phabricator.wikimedia.org/T225053) [11:28:51] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [11:28:52] (03CR) 10Ladsgroup: [C: 03+2] Switch property terms migration to WRITE_NEW on client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527087 (https://phabricator.wikimedia.org/T225053) (owner: 10Ladsgroup) [11:28:59] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): SRE: root access for Hieu Pham, SRE @ WMCS - https://phabricator.wikimedia.org/T229833 (10aborrero) 05Open→03Resolved This should be done. Anybody please reopen if there are any related issues. [11:29:15] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:07] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [11:31:11] (03Merged) 10jenkins-bot: Switch property terms migration to WRITE_NEW on client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527087 (https://phabricator.wikimedia.org/T225053) (owner: 10Ladsgroup) [11:31:26] (03CR) 10jenkins-bot: Switch property terms migration to WRITE_NEW on client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527087 (https://phabricator.wikimedia.org/T225053) (owner: 10Ladsgroup) [11:32:54] marostegui: synicing [11:33:07] go! [11:33:39] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:527087|Switch property terms migration to WRITE_NEW on client wikis (T225053)]] (duration: 00m 56s) [11:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:47] T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053 [11:35:19] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:36:33] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:37:19] !log Updated cxserver to 2019-08-06-100812-production (T227571) [11:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:28] T227571: Create cxserver api to suggest source title for given target language and title - https://phabricator.wikimedia.org/T227571 [11:38:39] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:39:05] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:40:03] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:40:12] Amir1: everything seems stable [11:40:18] I am monitoring the master and db1109 [11:40:30] Yeah, I keep monitoring things [11:40:57] this also looks stable: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All [11:41:09] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:41:10] no pattern changes [11:42:59] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:43:19] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:43:23] RECOVERY - puppet last run on schema2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:43:41] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:43:56] Things are going into cache now [11:44:03] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:44:33] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [11:44:55] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [11:44:55] (03PS1) 10Thcipriani: gerrit: replication: escape project slashes [puppet] - 10https://gerrit.wikimedia.org/r/528769 (https://phabricator.wikimedia.org/T229945) [11:46:31] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:46:38] (03CR) 10Paladox: gerrit: replication: escape project slashes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528769 (https://phabricator.wikimedia.org/T229945) (owner: 10Thcipriani) [11:47:13] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:48:30] !log EU SWAT is done [11:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:21] (03PS1) 10Fdans: Merge branch 'production' of https://gerrit.wikimedia.org/r/operations/puppet into HEAD [puppet] - 10https://gerrit.wikimedia.org/r/528770 [11:50:23] (03PS1) 10Fdans: role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/528771 [11:51:51] (03Abandoned) 10Fdans: Merge branch 'production' of https://gerrit.wikimedia.org/r/operations/puppet into HEAD [puppet] - 10https://gerrit.wikimedia.org/r/528770 (owner: 10Fdans) [11:55:10] (03PS2) 10Thcipriani: gerrit: replication: escape project slashes [puppet] - 10https://gerrit.wikimedia.org/r/528769 (https://phabricator.wikimedia.org/T229945) [11:55:17] (03PS7) 10Vgutierrez: Backport required OCSP commits [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) [11:55:42] (03CR) 10jerkins-bot: [V: 04-1] Backport required OCSP commits [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) (owner: 10Vgutierrez) [11:55:51] that was fast... [11:56:00] (03CR) 10Thcipriani: gerrit: replication: escape project slashes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528769 (https://phabricator.wikimedia.org/T229945) (owner: 10Thcipriani) [11:57:09] (03PS8) 10Vgutierrez: Backport required OCSP commits [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) [11:58:44] (03CR) 10Paladox: [C: 03+1] gerrit: replication: escape project slashes [puppet] - 10https://gerrit.wikimedia.org/r/528769 (https://phabricator.wikimedia.org/T229945) (owner: 10Thcipriani) [12:00:50] (03CR) 10jerkins-bot: [V: 04-1] Backport required OCSP commits [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) (owner: 10Vgutierrez) [12:11:15] (03PS1) 10Elukey: profile::kerberos::kdc: add daily backup for the KDC database [puppet] - 10https://gerrit.wikimedia.org/r/528775 (https://phabricator.wikimedia.org/T226089) [12:12:21] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:13:34] (03Abandoned) 10Elukey: role::aqs: update druid configuration with new MW snapshot [puppet] - 10https://gerrit.wikimedia.org/r/528445 (owner: 10Elukey) [12:14:50] (03PS2) 10Elukey: profile::kerberos::kdc: add daily backup for the KDC database [puppet] - 10https://gerrit.wikimedia.org/r/528775 (https://phabricator.wikimedia.org/T226089) [12:14:54] (03CR) 10Elukey: [C: 03+2] role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/528771 (owner: 10Fdans) [12:15:10] (03CR) 10Elukey: role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/528771 (owner: 10Fdans) [12:15:38] (03CR) 10Elukey: "Fran the parent seems to be https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/517085/3, is your puppet repo up to date?" [puppet] - 10https://gerrit.wikimedia.org/r/528771 (owner: 10Fdans) [12:16:55] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17772/kerberos1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/528775 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [12:29:13] PROBLEM - puppet last run on cp1090 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mtail/varnishxcps.mtail] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:32:29] (03CR) 10Volans: [C: 03+1] "LGTM modulo the question to activate it on both servers or only the primary. No need to wait for me to be back." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521313 (owner: 10CRusnov) [12:35:16] 10Operations, 10SRE-Access-Requests: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10abi_) [12:37:31] PROBLEM - Host poolcounter2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:39:25] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 107.2 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [12:44:38] (03PS2) 10Fdans: role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/528771 [12:45:27] (03PS3) 10Fdans: role::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/528771 [12:47:56] (03CR) 10Elukey: [C: 03+2] role::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/528771 (owner: 10Fdans) [12:57:11] RECOVERY - puppet last run on cp1090 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:01:59] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:11:29] (03PS1) 10Fomafix: Add 'bho' as alias for 'bh' [dns] - 10https://gerrit.wikimedia.org/r/528781 (https://phabricator.wikimedia.org/T41968) [13:11:41] (03PS1) 10Fomafix: Add 'bho' as alias for 'bh' [puppet] - 10https://gerrit.wikimedia.org/r/528782 (https://phabricator.wikimedia.org/T41968) [13:16:27] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10ema) Horizon really is unbearably slow, to the point of being almost unusable. To add a data point, I've measured 16.21s simply... [13:18:14] (03CR) 10Marostegui: [C: 04-2] "Still need to manually apply the grants on the databases" [puppet] - 10https://gerrit.wikimedia.org/r/528734 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [13:19:30] (03CR) 10Marostegui: [C: 04-2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/17773/" [puppet] - 10https://gerrit.wikimedia.org/r/528734 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [13:19:41] 10Operations, 10SRE-tools: Create a cookbook to restart the jvms on a Cassandra cluster - https://phabricator.wikimedia.org/T230022 (10elukey) [13:22:03] !log roll restart aqs on aqs100[4-9] to pick up new Druid backend settings [13:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:28] 10Operations, 10SRE-tools: Create a cookbook to restart the jvms on a Cassandra cluster - https://phabricator.wikimedia.org/T230022 (10MoritzMuehlenhoff) @jbond added that a fews days ago in https://gerrit.wikimedia.org/r/#/c/operations/cookbooks/+/528133/ :-) [13:25:19] 10Operations, 10SRE-tools: Create a cookbook to restart the jvms on a Cassandra cluster - https://phabricator.wikimedia.org/T230022 (10elukey) Really nice! AQS is not supported and I wasn't aware :P [13:25:24] 10Operations, 10SRE-tools: Create a cookbook to restart the jvms on a Cassandra cluster - https://phabricator.wikimedia.org/T230022 (10elukey) 05Open→03Resolved [13:25:26] 10Operations, 10SRE-tools, 10User-Joe, 10User-jijiki: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10elukey) [13:27:13] (03PS1) 10Elukey: sre.cassandra.roll-restart.py: support AQS [cookbooks] - 10https://gerrit.wikimedia.org/r/528786 [13:28:17] 10Operations, 10SRE-tools: Create a cookbook to restart the jvms on a Cassandra cluster - https://phabricator.wikimedia.org/T230022 (10MoritzMuehlenhoff) I supports single instance Cassandra clusters as well (for maps), so all it should take is to add "aqs" to the list of clusters [13:28:33] (03CR) 10Muehlenhoff: [C: 03+1] sre.cassandra.roll-restart.py: support AQS [cookbooks] - 10https://gerrit.wikimedia.org/r/528786 (owner: 10Elukey) [13:31:19] (03CR) 10Gehel: lvs: isolate cloudelastic icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528491 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [13:32:48] (03CR) 10Gehel: [C: 03+2] Cassandra nodetool repair cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/517377 (https://phabricator.wikimedia.org/T225694) (owner: 10Mathew.onipe) [13:32:54] (03CR) 10Elukey: [C: 03+2] sre.cassandra.roll-restart.py: support AQS [cookbooks] - 10https://gerrit.wikimedia.org/r/528786 (owner: 10Elukey) [13:32:58] (03PS2) 10Elukey: sre.cassandra.roll-restart.py: support AQS [cookbooks] - 10https://gerrit.wikimedia.org/r/528786 [13:40:00] (03CR) 10CDanis: [C: 03+1] prometheus: start collecting mediawiki aggregated stats [puppet] - 10https://gerrit.wikimedia.org/r/528733 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [13:41:19] 10Operations: Update component/php72 to 7.2.20 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff) [13:41:31] 10Operations, 10serviceops: Update component/php72 to 7.2.20 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff) [13:41:45] (03PS3) 10Ema: Add Sukhbir Singh (sukhe) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/528585 (https://phabricator.wikimedia.org/T229860) (owner: 10Ssingh) [13:42:49] (03CR) 10Filippo Giunchedi: lvs: isolate cloudelastic icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528491 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [13:43:11] (03CR) 10Ema: [C: 03+2] Add Sukhbir Singh (sukhe) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/528585 (https://phabricator.wikimedia.org/T229860) (owner: 10Ssingh) [13:43:22] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10MoritzMuehlenhoff) The update has been accepted by the Debian stable release managers and was uploded:... [13:45:05] PROBLEM - Check whether ferm is active by checking the default input chain on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:45:05] PROBLEM - configured eth on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:45:23] lovely [13:45:47] PROBLEM - Disk space on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [13:45:49] PROBLEM - MD RAID on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:45:53] PROBLEM - Check size of conntrack table on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [13:46:29] restarted the nrpe server [13:46:41] RECOVERY - Check whether ferm is active by checking the default input chain on stat1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:46:41] RECOVERY - configured eth on stat1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:47:05] OOM killer acted [13:47:23] RECOVERY - Disk space on stat1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [13:47:25] RECOVERY - MD RAID on stat1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:47:29] RECOVERY - Check size of conntrack table on stat1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [13:47:59] !log Apply grants for dbproxy1003 on m3 - T202367 [13:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:08] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [13:49:39] (03PS2) 10Subramanya Sastry: WIP: Add conditional loading of Parsoid/PHP as an extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) [13:50:21] (03CR) 10Subramanya Sastry: "> Patch Set 1: Code-Review-1" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) (owner: 10Subramanya Sastry) [13:52:07] (03CR) 10Marostegui: "Grants deployed on m3 databases for dbproxy2003" [puppet] - 10https://gerrit.wikimedia.org/r/528734 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [13:53:55] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add conditional loading of Parsoid/PHP as an extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) (owner: 10Subramanya Sastry) [13:55:10] !log Remove labsdb1004 and labsdb1005 from zarcillo database, as those hosts were decommissioned months ago [13:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:34] (03CR) 10CDanis: [C: 03+1] "seems very reasonable to me" [puppet] - 10https://gerrit.wikimedia.org/r/528719 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [13:56:58] (03PS1) 10Marostegui: report_users: Add script to repo [software] - 10https://gerrit.wikimedia.org/r/528801 [13:57:11] !log Remove labsdb1004 and labsdb1005 from zarcillo database (instance table), as those hosts were decommissioned months ago [13:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:35] 10Operations, 10SRE-Access-Requests, 10Traffic, 10Patch-For-Review: SRE Onboarding for Sukhbir Singh - https://phabricator.wikimedia.org/T229860 (10ssingh) [14:01:46] !log disable puppet fleet wide for puppetdb restart [14:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:28] (03PS1) 10Filippo Giunchedi: prometheus: update snmp_exporter config [puppet] - 10https://gerrit.wikimedia.org/r/528805 (https://phabricator.wikimedia.org/T148541) [14:04:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:04:53] (03CR) 10Giuseppe Lavagetto: "A couple minor corrections to the bash script, LGTM otherwise." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/528527 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [14:05:07] (03PS3) 10Subramanya Sastry: Add conditional loading of Parsoid/PHP as an extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) [14:05:09] (03PS1) 10Subramanya Sastry: Set up a multiversion rest.php endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528806 (https://phabricator.wikimedia.org/T229356) [14:06:19] (03PS2) 10Mathew.onipe: lvs: isolate cloudelastic icinga check [puppet] - 10https://gerrit.wikimedia.org/r/528491 (https://phabricator.wikimedia.org/T229621) [14:06:48] (03CR) 10jerkins-bot: [V: 04-1] Add conditional loading of Parsoid/PHP as an extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) (owner: 10Subramanya Sastry) [14:07:52] (03CR) 10Subramanya Sastry: "If required, I can protect the requires with a wfHostName() === 'scandium' check for now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528806 (https://phabricator.wikimedia.org/T229356) (owner: 10Subramanya Sastry) [14:08:16] (03CR) 10Marostegui: [V: 03+2 C: 03+2] report_users: Add script to repo [software] - 10https://gerrit.wikimedia.org/r/528801 (owner: 10Marostegui) [14:08:33] (03CR) 10Subramanya Sastry: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528806 (https://phabricator.wikimedia.org/T229356) (owner: 10Subramanya Sastry) [14:08:35] (03PS2) 10Marostegui: maintain-views.yaml: Remove math table [puppet] - 10https://gerrit.wikimedia.org/r/528724 (https://phabricator.wikimedia.org/T196055) [14:08:42] (03PS1) 10Ema: ATS: add outbound_tls_settings for labs [puppet] - 10https://gerrit.wikimedia.org/r/528808 [14:08:48] (03PS4) 10Marostegui: mariadb: Productionize dbproxy2003 into m3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/528734 (https://phabricator.wikimedia.org/T202367) [14:09:53] (03CR) 10Gergő Tisza: [C: 03+1] Set up a multiversion rest.php endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528806 (https://phabricator.wikimedia.org/T229356) (owner: 10Subramanya Sastry) [14:10:15] (03CR) 10Mathew.onipe: "PCC Output is Ok: https://puppet-compiler.wmflabs.org/compiler1002/17774/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528491 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [14:11:47] (03PS8) 10CDanis: noc: fetch dbconfig from etcd to local disk [puppet] - 10https://gerrit.wikimedia.org/r/528527 (https://phabricator.wikimedia.org/T229631) [14:12:33] (03CR) 10CDanis: noc: fetch dbconfig from etcd to local disk (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/528527 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [14:16:13] !log puppet not re-enabled [14:16:20] !log puppet *now* re-enabled [14:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:01] (03CR) 10Ema: [C: 03+2] ATS: add outbound_tls_settings for labs [puppet] - 10https://gerrit.wikimedia.org/r/528808 (owner: 10Ema) [14:21:10] (03CR) 10Gehel: [C: 03+1] "This looks like a noop for the LVS servers in puppet compiler: https://puppet-compiler.wmflabs.org/compiler1002/17775/" [puppet] - 10https://gerrit.wikimedia.org/r/528491 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [14:21:12] (03CR) 10Gehel: [C: 03+2] lvs: isolate cloudelastic icinga check [puppet] - 10https://gerrit.wikimedia.org/r/528491 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [14:21:51] (03PS3) 10Gehel: lvs: isolate cloudelastic icinga check [puppet] - 10https://gerrit.wikimedia.org/r/528491 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [14:24:02] !log Reboot dbproxy2003 for kernel upgrades [14:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:57] (03PS2) 10Filippo Giunchedi: prometheus: start collecting mediawiki aggregated stats [puppet] - 10https://gerrit.wikimedia.org/r/528733 (https://phabricator.wikimedia.org/T228878) [14:26:22] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: start collecting mediawiki aggregated stats [puppet] - 10https://gerrit.wikimedia.org/r/528733 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [14:27:55] (03CR) 10Bstorm: "This will place it in docker registry config. Does it also or rather need to be in the Apache config? That's my only question. I love t" [puppet] - 10https://gerrit.wikimedia.org/r/528617 (owner: 10BryanDavis) [14:31:29] (03PS9) 10CDanis: noc: fetch dbconfig from etcd to local disk [puppet] - 10https://gerrit.wikimedia.org/r/528527 (https://phabricator.wikimedia.org/T229631) [14:31:42] (03CR) 10BryanDavis: "> This will place it in docker registry config. Does it also or" [puppet] - 10https://gerrit.wikimedia.org/r/528617 (owner: 10BryanDavis) [14:33:24] (03PS1) 10Elukey: admin: add the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/528823 [14:34:42] (03CR) 10CDanis: [C: 03+2] noc: fetch dbconfig from etcd to local disk [puppet] - 10https://gerrit.wikimedia.org/r/528527 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [14:34:45] (03Abandoned) 10Elukey: admin: add the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/528823 (owner: 10Elukey) [14:34:57] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/528617 (owner: 10BryanDavis) [14:35:23] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [14:36:59] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:37:14] (03PS1) 10Elukey: admin: add the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/528826 [14:38:19] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: update snmp_exporter config [puppet] - 10https://gerrit.wikimedia.org/r/528805 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [14:38:26] (03PS2) 10Filippo Giunchedi: prometheus: update snmp_exporter config [puppet] - 10https://gerrit.wikimedia.org/r/528805 (https://phabricator.wikimedia.org/T148541) [14:38:36] icinga config is me, checkign [14:39:29] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): elastic1031 - PSU status critical - https://phabricator.wikimedia.org/T229453 (10Cmjohnson) 05Open→03Resolved I will resolve this task for now....if it becomes critical please open again. [14:39:31] 10Operations, 10hardware-requests, 10Discovery-Search (Current work): Replace elastic1017-1031 - https://phabricator.wikimedia.org/T221636 (10Cmjohnson) [14:39:39] cd /Users/ltoscano/puppet [14:39:39] cd /Users/ltoscano/puppet [14:39:44] sure [14:39:45] :) [14:40:01] XDDD [14:40:26] elukey: osx -> https://preview.redd.it/1fze06y6rwr21.gif?width=480&format=mp4&s=696db46551322eb53bcc98927cb6666a9105f0ff [14:40:55] elukey: getting chatops setup on your laptop? :) [14:41:39] PROBLEM - puppet last run on mwmaint1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/log/fetch_dbconfig/fetch_dbconfig] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:41:44] (03PS1) 10Elukey: Add gpu-users to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/528827 [14:41:58] ahahah I am not even sure how I did it, I don't have puppet under that dir, I was trying to open a link from irrsi [14:42:03] didn't work as expected :D [14:43:08] (03PS1) 10CDanis: noc fetch dbconfig: fix [puppet] - 10https://gerrit.wikimedia.org/r/528829 [14:43:34] (03PS1) 10Gehel: Revert "lvs: isolate cloudelastic icinga check" [puppet] - 10https://gerrit.wikimedia.org/r/528830 [14:43:59] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10Cmjohnson) Enclosure Device ID: 32 Slot Number: 2 Enclosure position: 1 Device Id: 2 WWN: 55cd2e415050a562 Sequence Number: 4 Media Error Count: 75 Other Error Count: 267 Predictive... [14:44:12] (03PS1) 10Fomafix: Add 'cmn' as alias for 'zh' [dns] - 10https://gerrit.wikimedia.org/r/528831 (https://phabricator.wikimedia.org/T23915) [14:44:23] (03PS2) 10Gehel: Revert "lvs: isolate cloudelastic icinga check" [puppet] - 10https://gerrit.wikimedia.org/r/528830 [14:44:40] (03PS1) 10Fomafix: Add 'cmn' as alias for 'zh' [puppet] - 10https://gerrit.wikimedia.org/r/528835 (https://phabricator.wikimedia.org/T23915) [14:46:15] PROBLEM - DPKG on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:46:15] PROBLEM - Check systemd state on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:23] PROBLEM - Check whether ferm is active by checking the default input chain on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:46:25] PROBLEM - configured eth on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:46:48] again the oom [14:47:53] RECOVERY - DPKG on stat1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:47:53] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:01] RECOVERY - Check whether ferm is active by checking the default input chain on stat1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:48:01] RECOVERY - configured eth on stat1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:48:02] (03PS5) 10Bstorm: icinga: Set the WMCS host alerts to go only to WMCS [puppet] - 10https://gerrit.wikimedia.org/r/528581 (https://phabricator.wikimedia.org/T229884) [14:48:50] (03PS6) 10CRusnov: netbox: Add configuration and timers for csv dumps [puppet] - 10https://gerrit.wikimedia.org/r/521313 [14:50:13] (03CR) 10CRusnov: netbox: redirect swagger doc requests to official docs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528531 (owner: 10CRusnov) [14:50:20] (03CR) 10Bstorm: [C: 03+2] icinga: Set the WMCS host alerts to go only to WMCS [puppet] - 10https://gerrit.wikimedia.org/r/528581 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [14:51:40] (03PS2) 10CDanis: noc fetch dbconfig: fix [puppet] - 10https://gerrit.wikimedia.org/r/528829 [14:53:36] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10aborrero) [14:53:37] (03PS12) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [14:54:01] (03CR) 10Mathew.onipe: Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [14:55:12] (03CR) 10Gehel: [C: 03+2] Revert "lvs: isolate cloudelastic icinga check" [puppet] - 10https://gerrit.wikimedia.org/r/528830 (owner: 10Gehel) [14:55:21] (03PS3) 10Gehel: Revert "lvs: isolate cloudelastic icinga check" [puppet] - 10https://gerrit.wikimedia.org/r/528830 [14:56:07] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10aborrero) We had a really high IO usage on this server the other day, along with very high load avg. {F29989813} https://grafana.wikimedia.org/d/aJgffPPmz/wmcs-openstack-e... [14:56:50] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10Cmjohnson) A ticket has been opened with Dell You have successfully submitted request SR995773442. [14:57:08] * Urbanecm is downloading ~70 GB of data to mwmaint1002 for T223052 (server-side upload) [14:57:22] (03PS3) 10CDanis: noc fetch dbconfig: fix logging snafu [puppet] - 10https://gerrit.wikimedia.org/r/528829 (https://phabricator.wikimedia.org/T229631) [14:58:38] (03CR) 10CDanis: [C: 03+2] noc fetch dbconfig: fix logging snafu [puppet] - 10https://gerrit.wikimedia.org/r/528829 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [14:58:49] PROBLEM - puppet last run on mwmaint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/log/fetch_dbconfig/fetch_dbconfig] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:59:12] (03PS4) 10CDanis: noc fetch dbconfig: fix logging snafu [puppet] - 10https://gerrit.wikimedia.org/r/528829 (https://phabricator.wikimedia.org/T229631) [15:00:46] (03CR) 10jerkins-bot: [V: 04-1] Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [15:01:57] (03PS3) 10Cwhite: icinga: disable autocomplete.js in icinga search text input [puppet] - 10https://gerrit.wikimedia.org/r/528586 [15:02:35] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10aborrero) >>! In T149589#5399356, @ema wrote: > Horizon really is unbearably slow, to the point of being almost unusable. > I... [15:02:46] (03CR) 10Cwhite: icinga: disable autocomplete.js in icinga search text input (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/528586 (owner: 10Cwhite) [15:03:35] RECOVERY - puppet last run on mwmaint1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:06:43] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [15:07:28] (03PS1) 10CDanis: noc fetch_dbconfig: fix script mode [puppet] - 10https://gerrit.wikimedia.org/r/528850 [15:08:40] <_joe_> !log freeing APCu on mw1270, which has degraded performance [15:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:19] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10JHedden) Viewing the instance console log can occasionally take longer than expected. This process queries multiple APIs and com... [15:10:24] (03CR) 10CDanis: [C: 03+2] noc fetch_dbconfig: fix script mode [puppet] - 10https://gerrit.wikimedia.org/r/528850 (owner: 10CDanis) [15:10:55] (03PS1) 10Ssingh: Add Sukhbir Singh (sukhe) to icinga groups [puppet] - 10https://gerrit.wikimedia.org/r/528851 (https://phabricator.wikimedia.org/T229860) [15:10:59] (03PS5) 10Volans: cookbook API: add class API [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) [15:11:24] it's WIP, tests will fail, just to show to others ^^^ [15:14:21] elukey: T230022 makes me think you haven't known about c-foreach-restart [15:14:23] T230022: Create a cookbook to restart the jvms on a Cassandra cluster - https://phabricator.wikimedia.org/T230022 [15:14:50] (03PS2) 10Gehel: Define cloudelastic as a cluster in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/528554 (https://phabricator.wikimedia.org/T229937) (owner: 10EBernhardson) [15:14:54] (03CR) 10Muehlenhoff: [C: 04-1] admin: add the gpu-users group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528826 (owner: 10Elukey) [15:15:16] (03PS1) 10Filippo Giunchedi: prometheus: bump timeout for pdu jobs [puppet] - 10https://gerrit.wikimedia.org/r/528856 (https://phabricator.wikimedia.org/T148541) [15:15:18] (03PS1) 10Filippo Giunchedi: prometheus: add sentry4 outlet OIDs [puppet] - 10https://gerrit.wikimedia.org/r/528857 (https://phabricator.wikimedia.org/T148541) [15:15:24] (03PS13) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [15:15:52] (03CR) 10jerkins-bot: [V: 04-1] cookbook API: add class API [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [15:15:55] urandom: Moritz already told me, I added the aqs support :) [15:16:10] (03CR) 10Gehel: [C: 03+2] Define cloudelastic as a cluster in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/528554 (https://phabricator.wikimedia.org/T229937) (owner: 10EBernhardson) [15:16:41] elukey: I thought you knew about the stuff in cassandra-tools-wmf [15:17:06] 10Operations, 10MediaWiki-Configuration, 10conftool, 10Performance-Team (Radar): noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10CDanis) @Marostegui as of now, there is https://noc.wikimedia.org/dbconfig/eqiad.json and https://noc.wiki... [15:17:12] elukey: (there are other handy things in there too) [15:17:41] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) [15:18:07] (03CR) 10Elukey: admin: add the gpu-users group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528826 (owner: 10Elukey) [15:18:37] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: bump timeout for pdu jobs [puppet] - 10https://gerrit.wikimedia.org/r/528856 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [15:18:44] urandom: yeah I recall that we discussed it, but never tried sorry :( [15:18:45] (03PS2) 10Filippo Giunchedi: prometheus: bump timeout for pdu jobs [puppet] - 10https://gerrit.wikimedia.org/r/528856 (https://phabricator.wikimedia.org/T148541) [15:21:44] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] prometheus: bump timeout for pdu jobs [puppet] - 10https://gerrit.wikimedia.org/r/528856 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [15:23:51] (03PS1) 10Effie Mouzeli: mediawiki:maintenance: switch wikidata to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528875 (https://phabricator.wikimedia.org/T195392) [15:24:30] RECOVERY - puppet last run on mwmaint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:25:38] (03PS2) 10Elukey: admin: add the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/528826 [15:25:40] (03PS2) 10Elukey: Add gpu-users to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/528827 [15:30:28] (03PS4) 10Cwhite: icinga: disable autocomplete.js in icinga search text input [puppet] - 10https://gerrit.wikimedia.org/r/528586 [15:35:13] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10ema) >>! In T149589#5399667, @aborrero wrote: >>>! In T149589#5399356, @ema wrote: >> Horizon really is unbearably slow, to the... [15:36:02] (03PS2) 10Effie Mouzeli: mediawiki:maintenance: switch wikidata to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528875 (https://phabricator.wikimedia.org/T195392) [15:36:29] (03CR) 10Ottomata: [C: 03+1] admin: add the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/528826 (owner: 10Elukey) [15:40:52] (03CR) 10Ladsgroup: mediawiki:maintenance: switch wikidata to PHP 7.2 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/528875 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [15:42:15] (03PS3) 10Effie Mouzeli: mediawiki:maintenance: switch wikidata to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528875 (https://phabricator.wikimedia.org/T195392) [15:46:37] (03CR) 10Effie Mouzeli: mediawiki:maintenance: switch wikidata to PHP 7.2 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/528875 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [15:46:48] (03CR) 10Effie Mouzeli: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/17785/mwmaint1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/528875 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [15:47:36] (03CR) 10Ladsgroup: [C: 03+1] mediawiki:maintenance: switch wikidata to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528875 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [15:48:47] (03PS2) 10Ssingh: Add Sukhbir Singh (sukhe) to icinga groups [puppet] - 10https://gerrit.wikimedia.org/r/528851 (https://phabricator.wikimedia.org/T229860) [15:48:49] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki:maintenance: switch wikidata to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528875 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [15:50:19] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] mediawiki:maintenance: switch wikidata to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528875 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [15:50:29] (03PS4) 10Effie Mouzeli: mediawiki:maintenance: switch wikidata to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528875 (https://phabricator.wikimedia.org/T195392) [15:53:48] (03CR) 10Ema: [C: 03+1] Add Sukhbir Singh (sukhe) to icinga groups [puppet] - 10https://gerrit.wikimedia.org/r/528851 (https://phabricator.wikimedia.org/T229860) (owner: 10Ssingh) [15:57:28] PROBLEM - dhclient process on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:57:44] PROBLEM - configured eth on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:57:48] PROBLEM - Disk space on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [15:58:06] PROBLEM - Check whether ferm is active by checking the default input chain on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:58:32] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [15:58:32] !log restart npre on stat1004 [15:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:12] thanks :) [15:59:49] 10Operations, 10SRE-Access-Requests: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10Reedy) Will need to get your manager to sign this off [16:00:00] RECOVERY - dhclient process on stat1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190807T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:18] RECOVERY - configured eth on stat1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [16:00:24] RECOVERY - Disk space on stat1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [16:00:43] !log restarting jenkins for update [16:00:44] RECOVERY - Check whether ferm is active by checking the default input chain on stat1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:54] (03PS1) 10Mathew.onipe: lvs: isolate cloudelastic icinga check [puppet] - 10https://gerrit.wikimedia.org/r/528885 (https://phabricator.wikimedia.org/T229621) [16:10:46] PROBLEM - Device not healthy -SMART- on cloudelastic1002 is CRITICAL: cluster=cloudelastic device=sdb instance=cloudelastic1002:9100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudelastic1002&var-datasource=eqiad+prometheus/ops [16:13:44] (03CR) 10Ssingh: [C: 03+2] Add Sukhbir Singh (sukhe) to icinga groups [puppet] - 10https://gerrit.wikimedia.org/r/528851 (https://phabricator.wikimedia.org/T229860) (owner: 10Ssingh) [16:14:32] (03PS3) 10Ssingh: Add Sukhbir Singh (sukhe) to icinga groups [puppet] - 10https://gerrit.wikimedia.org/r/528851 (https://phabricator.wikimedia.org/T229860) [16:17:34] (03CR) 10Mathew.onipe: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/528885 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [16:18:10] (03CR) 10Mathew.onipe: "PCC output is Ok: https://puppet-compiler.wmflabs.org/compiler1001/17786/" [puppet] - 10https://gerrit.wikimedia.org/r/528885 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [16:20:58] (03PS1) 10Ppchelko: Switch high-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528886 (https://phabricator.wikimedia.org/T228705) [16:22:13] (03PS4) 10Giuseppe Lavagetto: envoyproxy: create module, add tls terminator definition [puppet] - 10https://gerrit.wikimedia.org/r/526110 [16:22:15] (03PS1) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: new TLS terminator for services [puppet] - 10https://gerrit.wikimedia.org/r/528887 [16:29:06] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1004 is OK: OK: synced at Wed 2019-08-07 16:29:05 UTC. https://wikitech.wikimedia.org/wiki/NTP [16:31:24] (03CR) 10Mobrovac: [C: 03+2] Switch high-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528886 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [16:31:33] (03PS1) 10Effie Mouzeli: mediawiki:maintenance: switch wikidata to PHP 7.2 (prod) [puppet] - 10https://gerrit.wikimedia.org/r/528889 (https://phabricator.wikimedia.org/T195392) [16:31:43] * mobrovac taking the deployment server for a cfg patch [16:32:29] (03Merged) 10jenkins-bot: Switch high-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528886 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [16:32:45] (03CR) 10jenkins-bot: Switch high-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528886 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [16:34:58] !log mobrovac@deploy1001 scap failed: average error rate on 6/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [16:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:18] Pchelolo: ^ [16:35:22] error rate too high [16:35:38] Pchelolo: reverting [16:35:44] mobrovac: hm... [16:35:53] Pchelolo: or i can retry [16:36:00] mobrovac: nono [16:36:06] no to what? [16:36:41] (03CR) 10Ladsgroup: [C: 03+1] mediawiki:maintenance: switch wikidata to PHP 7.2 (prod) [puppet] - 10https://gerrit.wikimedia.org/r/528889 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [16:36:43] nono, revert and don't try again [16:36:51] k [16:37:17] (03PS1) 10Mobrovac: Revert "Switch high-traffic jobs to eventgate." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528890 (https://phabricator.wikimedia.org/T228705) [16:38:50] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] Revert "Switch high-traffic jobs to eventgate." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528890 (https://phabricator.wikimedia.org/T228705) (owner: 10Mobrovac) [16:39:09] (03CR) 10jenkins-bot: Revert "Switch high-traffic jobs to eventgate." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528890 (https://phabricator.wikimedia.org/T228705) (owner: 10Mobrovac) [16:39:14] eww [16:39:15] hnm, [16:40:27] errors are back to normal now [16:40:37] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10aborrero) Yes, the puppet information in horizon is extremely slow, specially the Prefix Puppet pages. That in concrete is a kno... [16:40:37] !log mobrovac@deploy1001 Synchronized wmf-config/InitialiseSettings.php: JobQueue: Revert switching high-traffic jobs to eventgate (duration: 00m 55s) [16:40:41] (after deploying the revert) [16:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:36] "message": "'.delay_until' should match format \"date-time\"", [16:41:52] PROBLEM - Kafka topic throughput alert for eventgate-main_validation_errors in cluster jumbo-eqiad for topic-s- .-\.eventgate-main\.error\.validation. Message rate should be gt -0.0- 0.5-. on icinga1001 is CRITICAL: 0.7603 gt 0.5 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?refresh=1m&orgId=1&var-dc=eqiad+prometheus/k8s&var-service=eventgate-main&var-kafka_t [16:41:52] a_broker=All&var-kafka_producer_type=All [16:41:55] "delay_until\":\"1565196089\" [16:42:52] cool the alarm works! :) [16:42:59] 10Operations, 10SRE-Access-Requests, 10Traffic, 10Patch-For-Review: SRE Onboarding for Sukhbir Singh - https://phabricator.wikimedia.org/T229860 (10MoritzMuehlenhoff) [16:45:03] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki:maintenance: switch wikidata to PHP 7.2 (prod) [puppet] - 10https://gerrit.wikimedia.org/r/528889 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [16:53:47] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10bd808) I have dreams of a complete rewrite of our Puppet dashboard (what plugins are called in Horizon), but that is stuck behin... [16:55:00] (03Abandoned) 10Ottomata: Release 2.4.3 for Debian Buster [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/526527 (https://phabricator.wikimedia.org/T222253) (owner: 10Ottomata) [16:55:37] (03PS1) 10Jhedden: toolschecker: match nginx and wsgi timeouts [puppet] - 10https://gerrit.wikimedia.org/r/528892 (https://phabricator.wikimedia.org/T221301) [16:58:13] (03PS1) 10Ottomata: 2.3.1 release for buster and python 3.7 [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/528894 (https://phabricator.wikimedia.org/T229347) [16:59:09] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) [17:12:12] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10RobH) [17:15:36] (03PS4) 10Ssingh: Add Sukhbir Singh (sukhe) to icinga groups [puppet] - 10https://gerrit.wikimedia.org/r/528851 (https://phabricator.wikimedia.org/T229860) [17:16:31] (03PS1) 10Jhedden: toolschecker: check status for webservice tasks [puppet] - 10https://gerrit.wikimedia.org/r/528897 (https://phabricator.wikimedia.org/T221301) [17:17:07] (03CR) 10Elukey: [C: 03+1] 2.3.1 release for buster and python 3.7 [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/528894 (https://phabricator.wikimedia.org/T229347) (owner: 10Ottomata) [17:17:27] (03PS1) 10Bstorm: monitoring: change WMCS services to paging WMCS only [puppet] - 10https://gerrit.wikimedia.org/r/528898 (https://phabricator.wikimedia.org/T229884) [17:18:06] (03CR) 10jerkins-bot: [V: 04-1] toolschecker: check status for webservice tasks [puppet] - 10https://gerrit.wikimedia.org/r/528897 (https://phabricator.wikimedia.org/T221301) (owner: 10Jhedden) [17:18:53] 10Operations, 10SRE-Access-Requests, 10Traffic, 10Patch-For-Review: SRE Onboarding for Sukhbir Singh - https://phabricator.wikimedia.org/T229860 (10ssingh) [17:20:57] (03CR) 10Jhedden: "puppet compiler results: https://puppet-compiler.wmflabs.org/compiler1002/17787/tools-checker-03.tools.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/528892 (https://phabricator.wikimedia.org/T221301) (owner: 10Jhedden) [17:21:50] what mediawiki related cronjobs do we have that run every 2 hours? [17:25:42] (03PS2) 10Ottomata: 2.3.1-4 release with python 3.5 and python 3.7 compatibility [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/528894 (https://phabricator.wikimedia.org/T229347) [17:26:08] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [17:28:14] (03CR) 10Jhedden: "it's probably better to place the pykube exception directly in webservice... I'll look into that" [puppet] - 10https://gerrit.wikimedia.org/r/528897 (https://phabricator.wikimedia.org/T221301) (owner: 10Jhedden) [17:33:37] (03PS5) 10Giuseppe Lavagetto: envoyproxy: create module, add tls terminator definition [puppet] - 10https://gerrit.wikimedia.org/r/526110 [17:33:39] (03PS2) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: new TLS terminator for services [puppet] - 10https://gerrit.wikimedia.org/r/528887 [17:33:41] (03PS1) 10Giuseppe Lavagetto: [WiP] role::webserver_misc_static: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/528900 [17:37:43] 10Operations: install2002 short on disk space - https://phabricator.wikimedia.org/T229997 (10colewhite) [17:37:45] 10Operations: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 (10colewhite) [17:37:55] 10Operations: install2002 short on disk space - https://phabricator.wikimedia.org/T229997 (10colewhite) a:03colewhite [17:39:53] (03PS2) 10Jhedden: toolschecker: check status for webservice tasks [puppet] - 10https://gerrit.wikimedia.org/r/528897 (https://phabricator.wikimedia.org/T221301) [17:42:28] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Retry - Revert "Switch high-traffic jobs to eventgate." (duration: 00m 58s) [17:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:21] !log install2002 stop nginx and squid for resync /srv to spare disk and restore mount - T229997 [17:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:30] T229997: install2002 short on disk space - https://phabricator.wikimedia.org/T229997 [17:48:08] (03PS3) 10Jhedden: toolschecker: check status for webservice tasks [puppet] - 10https://gerrit.wikimedia.org/r/528897 (https://phabricator.wikimedia.org/T221301) [17:54:21] !log install2002 add fstab entry for /srv mount - T229997 [17:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:30] T229997: install2002 short on disk space - https://phabricator.wikimedia.org/T229997 [18:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190807T1800) [18:00:06] 10Operations: install2002 short on disk space - https://phabricator.wikimedia.org/T229997 (10colewhite) It looks like this has happened before, so I linked the case. A reboot occurred and the disk that was added in December was not remounted. The installserver synchronization process filled /srv with replica d... [18:00:08] 10Operations: install2002 short on disk space - https://phabricator.wikimedia.org/T229997 (10colewhite) 05Open→03Resolved [18:00:10] 10Operations: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 (10colewhite) [18:00:25] shdubsh: wow, where did you get that spare disk from ? [18:00:35] was it in there already? [18:01:05] from past you, mutante december 2018 edition [18:01:17] gj mutante ;p [18:02:15] (03CR) 10Ottomata: [V: 03+2 C: 03+2] 2.3.1-4 release with python 3.5 and python 3.7 compatibility [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/528894 (https://phabricator.wikimedia.org/T229347) (owner: 10Ottomata) [18:03:26] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10kzimmerman) [18:04:53] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10kzimmerman) Updated ticket to reflect 3 business days passing and approval from Nuria. @RobH... [18:05:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10kzimmerman) Sorry, I see that I should not have checked off those items - unchecking them! [18:06:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10kzimmerman) [18:06:50] RECOVERY - Kafka topic throughput alert for eventgate-main_validation_errors in cluster jumbo-eqiad for topic-s- .-\.eventgate-main\.error\.validation. Message rate should be gt -0.0- 0.5-. on icinga1001 is OK: (C)0.5 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?refresh=1m&orgId=1&var-dc=eqiad+prometheus/k8s&var-service=eventgate-main&var-kafka_ [18:06:50] ka_broker=All&var-kafka_producer_type=All [18:06:58] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10RobH) >>! In T228447#5400367, @kzimmerman wrote: > Updated ticket to reflect 3 business days... [18:09:54] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1001/17788/" [puppet] - 10https://gerrit.wikimedia.org/r/528898 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [18:11:51] (03PS3) 10Dzahn: gerrit: replication: escape project slashes [puppet] - 10https://gerrit.wikimedia.org/r/528769 (https://phabricator.wikimedia.org/T229945) (owner: 10Thcipriani) [18:15:15] !log Restart hhvm and php-fpm on canary mw hosts [18:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:29] (03CR) 10Dzahn: [C: 03+2] gerrit: replication: escape project slashes [puppet] - 10https://gerrit.wikimedia.org/r/528769 (https://phabricator.wikimedia.org/T229945) (owner: 10Thcipriani) [18:17:02] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python3.7] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:20:34] 10Operations, 10Domains, 10Product-Design-Strategy, 10Traffic: Add a repo reference to Design Strategy web address - https://phabricator.wikimedia.org/T230053 (10Volker_E) [18:20:56] 10Operations, 10Readers-Web-Backlog, 10Traffic: [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10Jdlrobson) a:05Jdlrobson→03ovasileva Not sure if you or Sam would be best placed to work out what to do with this. Talking to Operations... [18:21:23] 10Operations, 10Domains, 10Product-Design-Strategy, 10Traffic: Add a repo reference to Design Strategy web address - https://phabricator.wikimedia.org/T230053 (10Volker_E) [18:23:44] (03CR) 10Reedy: [C: 03+1] "I don't think it needs the hostname protection as it'll give a 403 to casual browsers etc based on wgEnableRestAPI" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528806 (https://phabricator.wikimedia.org/T229356) (owner: 10Subramanya Sastry) [18:25:15] 10Operations, 10Domains, 10Product-Design-Strategy, 10Traffic: Add a repo reference to Design Strategy web address - https://phabricator.wikimedia.org/T230053 (10Dzahn) This is `modules/profile/manifests/microsites/design.pp` in the puppet repo. Happy to help adding a third repo there to git clone from.... [18:25:16] (03PS1) 10Bstorm: monitoring: Switch a collection of WMCS alerts to email-only [puppet] - 10https://gerrit.wikimedia.org/r/528905 (https://phabricator.wikimedia.org/T229884) [18:27:46] (03PS1) 10Jbond: puppetmaster upgrade: add a lua filter to remove the job_id [puppet] - 10https://gerrit.wikimedia.org/r/528906 (https://phabricator.wikimedia.org/T230002) [18:29:55] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster upgrade: add a lua filter to remove the job_id [puppet] - 10https://gerrit.wikimedia.org/r/528906 (https://phabricator.wikimedia.org/T230002) (owner: 10Jbond) [18:33:58] (03CR) 10Jhedden: [C: 03+1] monitoring: Switch a collection of WMCS alerts to email-only [puppet] - 10https://gerrit.wikimedia.org/r/528905 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [18:38:33] (03PS2) 10Jbond: puppetmaster upgrade: add a lua filter to remove the job_id [puppet] - 10https://gerrit.wikimedia.org/r/528906 (https://phabricator.wikimedia.org/T230002) [18:41:33] (03PS3) 10Jbond: puppetmaster upgrade: add a lua filter to remove the job_id [puppet] - 10https://gerrit.wikimedia.org/r/528906 (https://phabricator.wikimedia.org/T230002) [18:43:53] 10Operations, 10Domains, 10Product-Design-Strategy, 10Traffic: Add a repo reference to Design Strategy web address - https://phabricator.wikimedia.org/T230053 (10Volker_E) > So the new one is called "strategy" and you wand /strategy as the URL as well? That's correct. :) [18:44:54] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:46:43] (03PS6) 10Bstorm: Update wmflabs.org redirect target [puppet] - 10https://gerrit.wikimedia.org/r/528304 (https://phabricator.wikimedia.org/T229896) (owner: 10DannyS712) [18:49:10] (03PS4) 10Jbond: puppetmaster upgrade: add a lua filter to remove the job_id [puppet] - 10https://gerrit.wikimedia.org/r/528906 (https://phabricator.wikimedia.org/T230002) [18:56:03] (03CR) 10Bstorm: [C: 03+2] Update wmflabs.org redirect target [puppet] - 10https://gerrit.wikimedia.org/r/528304 (https://phabricator.wikimedia.org/T229896) (owner: 10DannyS712) [18:56:10] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/528304 (https://phabricator.wikimedia.org/T229896) (owner: 10DannyS712) [18:59:13] (03PS1) 10Reedy: Revert "Switch property terms migration to WRITE_NEW on client wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528909 [18:59:16] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10kzimmerman) Thanks Rob! [18:59:27] (03PS2) 10Reedy: Revert "Switch property terms migration to WRITE_NEW on client wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528909 [19:00:04] brennen: (Dis)respected human, time to deploy MediaWiki train - American version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190807T1900). Please do the needful. [19:01:02] (03CR) 10Reedy: [C: 03+2] Revert "Switch property terms migration to WRITE_NEW on client wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528909 (owner: 10Reedy) [19:02:12] awaiting deploy of above revert. [19:02:33] gate and submit is buseh [19:03:10] (03PS3) 10Reedy: Revert "Switch property terms migration to WRITE_NEW on client wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528909 (https://phabricator.wikimedia.org/T225053) [19:03:24] (03CR) 10Reedy: [C: 03+2] Revert "Switch property terms migration to WRITE_NEW on client wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528909 (https://phabricator.wikimedia.org/T225053) (owner: 10Reedy) [19:14:19] (03Merged) 10jenkins-bot: Revert "Switch property terms migration to WRITE_NEW on client wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528909 (https://phabricator.wikimedia.org/T225053) (owner: 10Reedy) [19:16:03] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert Switch property terms migration to WRITE_NEW on client wikis T225053 (duration: 00m 58s) [19:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:13] T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053 [19:16:36] (03CR) 10jenkins-bot: Revert "Switch property terms migration to WRITE_NEW on client wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528909 (https://phabricator.wikimedia.org/T225053) (owner: 10Reedy) [19:19:50] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:24:08] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:24:34] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:25:46] going ahead with train → group1 [19:27:59] (03PS1) 10Brennen Bearnes: group1 wikis to 1.34.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528912 [19:28:01] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.34.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528912 (owner: 10Brennen Bearnes) [19:29:59] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528912 (owner: 10Brennen Bearnes) [19:30:15] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528912 (owner: 10Brennen Bearnes) [19:35:46] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.17 [19:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:41] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.17 (duration: 00m 54s) [19:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:48] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:43:44] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:44:52] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:54:50] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:58:47] (03PS1) 10Legoktm: extdist: Switch to gerrit-replica for cloning [puppet] - 10https://gerrit.wikimedia.org/r/528919 [20:00:38] (03CR) 10Legoktm: "This should be safe to merge any time." [puppet] - 10https://gerrit.wikimedia.org/r/528919 (owner: 10Legoktm) [20:02:57] (03PS1) 10Dzahn: design.wikimedia.org: add new dir and repo for strategy site [puppet] - 10https://gerrit.wikimedia.org/r/528922 (https://phabricator.wikimedia.org/T230053) [20:25:22] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:26:32] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:29:36] (03PS2) 10Bstorm: toolforge: modernize updatetools script [puppet] - 10https://gerrit.wikimedia.org/r/526309 (https://phabricator.wikimedia.org/T164971) (owner: 10BryanDavis) [20:32:05] (03PS4) 10Dzahn: gerrit: replication: escape project slashes [puppet] - 10https://gerrit.wikimedia.org/r/528769 (https://phabricator.wikimedia.org/T229945) (owner: 10Thcipriani) [20:32:34] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: labswiki back to .16 temporarily [20:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:24] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: labswiki back to .17 [20:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:13] (03PS3) 10Bstorm: toolforge: modernize updatetools script [puppet] - 10https://gerrit.wikimedia.org/r/526309 (https://phabricator.wikimedia.org/T164971) (owner: 10BryanDavis) [20:41:07] (03CR) 10Bstorm: [C: 03+2] toolforge: modernize updatetools script [puppet] - 10https://gerrit.wikimedia.org/r/526309 (https://phabricator.wikimedia.org/T164971) (owner: 10BryanDavis) [20:44:12] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:44:38] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:48:05] jouncebot: next [20:48:05] In 2 hour(s) and 11 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190807T2300) [20:48:24] (03CR) 10Jhedden: [C: 03+1] "LGTM comment on prometheus query, but not a blocker at all." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528898 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [20:50:18] 10Operations, 10Core Platform Team, 10MediaWiki-API, 10Wikidata, 10Wikidata-Campsite: wikidata.org handles GET MWAPI requests, but silently fails on POST - https://phabricator.wikimedia.org/T230051 (10Yurik) [20:57:12] 10Operations, 10Core Platform Team, 10MediaWiki-API, 10Wikidata, 10Wikidata-Campsite: wikidata.org handles GET MWAPI requests, but silently fails on POST - https://phabricator.wikimedia.org/T230051 (10Yurik) [21:00:18] !log apply transient logger settings from prod search clusters to cloudelastic [21:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:34] (03CR) 10Bstorm: monitoring: change WMCS services to paging WMCS only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528898 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [21:14:02] (03PS2) 10Bstorm: monitoring: change WMCS services to paging WMCS only [puppet] - 10https://gerrit.wikimedia.org/r/528898 (https://phabricator.wikimedia.org/T229884) [21:15:36] (03PS1) 10CDanis: noc: read dbctl JSON from local disk mirror of etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) [21:21:39] (03CR) 10Bstorm: [C: 03+2] monitoring: change WMCS services to paging WMCS only [puppet] - 10https://gerrit.wikimedia.org/r/528898 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [21:25:01] !log restarting gerrit service to apply config change (528769) [21:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:23] (03PS2) 10Dzahn: extdist: Switch to gerrit-replica for cloning [puppet] - 10https://gerrit.wikimedia.org/r/528919 (owner: 10Legoktm) [21:34:08] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:39:42] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:40:24] (03PS7) 10Jeena Huneidi: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935) [21:42:28] (03CR) 10Dzahn: [C: 03+2] extdist: Switch to gerrit-replica for cloning [puppet] - 10https://gerrit.wikimedia.org/r/528919 (owner: 10Legoktm) [21:49:06] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@a151f4e]: Prepare for eventgate transition T230049 T230048 [21:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:15] T230048: Change-Prop partitioner fails with eventgate event - https://phabricator.wikimedia.org/T230048 [21:49:16] T230049: Delayed jobs fail validation in eventgate - https://phabricator.wikimedia.org/T230049 [21:50:05] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@a151f4e]: Prepare for eventgate transition T230049 T230048 (duration: 00m 59s) [21:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:47] (03PS2) 10Bstorm: monitoring: Switch a collection of WMCS alerts to email-only [puppet] - 10https://gerrit.wikimedia.org/r/528905 (https://phabricator.wikimedia.org/T229884) [21:52:03] (03CR) 10Ori.livneh: "> Previously, the implicit mtime of the cache file was used, which is recorded *after* the input file is read" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [21:54:05] (03PS3) 10Bstorm: monitoring: Switch a collection of WMCS alerts to email-only [puppet] - 10https://gerrit.wikimedia.org/r/528905 (https://phabricator.wikimedia.org/T229884) [21:59:46] 10Operations, 10Wikimedia-Site-requests, 10wikimediafoundation.org, 10Security: Setting up static maintenance page on Foundation servers for Foundation website - https://phabricator.wikimedia.org/T230075 (10Varnent) [21:59:57] 10Operations, 10Wikimedia-Site-requests, 10wikimediafoundation.org, 10Security: Setting up static maintenance page on Foundation servers for Foundation website - https://phabricator.wikimedia.org/T230075 (10Varnent) p:05Triage→03Normal [22:00:42] 10Operations, 10Wikimedia-Site-requests, 10wikimediafoundation.org, 10Security: Setting up static maintenance page on Foundation servers for Foundation website - https://phabricator.wikimedia.org/T230075 (10Varnent) [22:01:35] (03PS3) 10Krinkle: CommonSettings: Store mtime inside wmf-config cache file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) [22:01:49] (03CR) 10Krinkle: "amended to answer Ori's questions :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [22:02:48] 10Operations, 10Wikimedia-Site-requests, 10wikimediafoundation.org, 10Security: Setting up static maintenance page on Foundation servers for Foundation website - https://phabricator.wikimedia.org/T230075 (10Varnent) [22:03:25] (03CR) 10Bstorm: [C: 03+2] monitoring: Switch a collection of WMCS alerts to email-only [puppet] - 10https://gerrit.wikimedia.org/r/528905 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [22:09:05] (03CR) 10Krinkle: CommonSettings: Clean up wmf-config caching code [no-op] (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528446 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [22:13:07] (03PS1) 10EBernhardson: Send writes for all non-private wikis to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528961 [22:13:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) I neglected to update this, but it passed all dell epsa tests without crash. If all we have is the log from T2208... [22:14:00] 10Operations, 10wikimediafoundation.org, 10Security: Setting up static maintenance page on Foundation servers for Foundation website - https://phabricator.wikimedia.org/T230075 (10Aklapper) [22:16:10] (03CR) 10jerkins-bot: [V: 04-1] Send writes for all non-private wikis to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528961 (owner: 10EBernhardson) [22:17:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) a:05RobH→03Cmjohnson Moving back to @Cmjohnson - can you try getting Dell to RMA you a motherboard? If t... [22:18:15] (03PS2) 10EBernhardson: Send writes for all non-private wikis to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528961 [22:27:35] (03CR) 10Subramanya Sastry: "Is it okay to get this swat deployed tomorrow then?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528806 (https://phabricator.wikimedia.org/T229356) (owner: 10Subramanya Sastry) [22:28:23] (03PS1) 10Alaa Sarhan: Use global $wgThumbLimits as default for repo and client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528963 [22:29:33] (03CR) 10Subramanya Sastry: "If this is reviewed tomorrow, can I get this swat deployed tomorrow? That way, we can work on Parsoid changes and scandium puppet changes " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) (owner: 10Subramanya Sastry) [22:47:03] (03PS5) 10MarcoAurelio: mediawiki::maintenance::purge_checkuser.pp: Switch to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528730 (https://phabricator.wikimedia.org/T195392) [22:48:00] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) p:05Triage→03Normal [22:48:12] !log mwmaint1002 - manually running the purgeOldData cron command to verify it with PHP 7.2 for 528730 (T195392) [22:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:26] T195392: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 [22:50:16] !log mwmaint start cirrussearch saneitize.php against all non-private group1 wikis for cloudelastic cluster [22:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:34] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [22:58:56] (03PS1) 10Bstorm: labstore: restore original sense of the load alert with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) [22:58:58] (03CR) 10Dzahn: [C: 03+2] "tested by running the command manually as the same user and with PHP 7.2, no issues, worked normally" [puppet] - 10https://gerrit.wikimedia.org/r/528730 (https://phabricator.wikimedia.org/T195392) (owner: 10MarcoAurelio) [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190807T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:10] i can ship it [23:02:30] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528961 (owner: 10EBernhardson) [23:03:26] !log set virtual-chassis vcp-snmp-statistics on asw-a-codfw - T228824 [23:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:33] T228824: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 [23:03:38] (03Merged) 10jenkins-bot: Send writes for all non-private wikis to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528961 (owner: 10EBernhardson) [23:03:54] (03CR) 10jenkins-bot: Send writes for all non-private wikis to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528961 (owner: 10EBernhardson) [23:07:27] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T220625: Send writes for all non-private wikis to cloudelastic (duration: 01m 02s) [23:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:37] T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625 [23:07:46] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: connect atlas-ulsfo to scs-ulsfo - https://phabricator.wikimedia.org/T206185 (10RobH) 05Open→03Stalled this is now blocked on the new scs setup and patch cables on T230077 [23:08:29] !log set virtual-chassis vcp-snmp-statistics on asw2-ulsfo - T228824 [23:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:37] T228824: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 [23:15:50] PROBLEM - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100% [23:16:44] 10Operations, 10wikimediafoundation.org, 10Security: Setting up static maintenance page on Foundation servers for Foundation website - https://phabricator.wikimedia.org/T230075 (10Reedy) >However, the idea came up that we should prepare a similar maintenance page on our servers for the site in case something... [23:18:04] (03CR) 10MarcoAurelio: "From a conversation with Dzahn on -releng. This might not be working as expected given that `foreachwiki` is still using `RUNNER=php`. Not" [puppet] - 10https://gerrit.wikimedia.org/r/528730 (https://phabricator.wikimedia.org/T195392) (owner: 10MarcoAurelio) [23:22:38] !log elastic2054 - powercycling after it went down unexpectedly and Icinga alerted, this happened before in T227298 [23:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:47] T227298: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 [23:23:02] 10Operations, 10netops: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) 05Open→03Resolved >According to engineering there is no much information that can be provided from the crash as the issue thread do not have any information and is blank. >This is was not reprodu... [23:24:48] RECOVERY - Host elastic2054 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [23:24:59] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Dzahn) 05Resolved→03Open [23:25:10] 10Operations, 10wikimediafoundation.org, 10Security: Setting up static maintenance page on Foundation servers for Foundation website - https://phabricator.wikimedia.org/T230075 (10Varnent) Yes - to be fair - immediate outcome is not the expectation, more immediate action on our part. The idea originated with... [23:26:28] 10Operations, 10Patch-For-Review: puppetdb queue size went up since July 30 - https://phabricator.wikimedia.org/T230002 (10colewhite) p:05Triage→03Normal [23:27:40] 10Operations, 10SRE-Access-Requests: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10colewhite) p:05Triage→03Normal [23:28:27] 10Operations, 10netops, 10observability: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 (10ayounsi) This is working! Why is that behind a configuration options and not enabled by default? I have no idea. Will let those two sit overnight and roll it to the whole fleet if all good. [23:28:52] 10Operations, 10cloud-services-team (Kanban): Requesting access to Puppet for Viztor[S] - https://phabricator.wikimedia.org/T229894 (10colewhite) [23:29:28] 10Operations, 10Release-Engineering-Team, 10cloud-services-team (Kanban): Requesting access to Puppet for Viztor[S] - https://phabricator.wikimedia.org/T229894 (10colewhite) [23:35:59] (03CR) 10Cwhite: [C: 04-1] add conniecc1 to analytics-(wmde|privatedata)-users,researchers group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525578 (https://phabricator.wikimedia.org/T228447) (owner: 10Herron) [23:37:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10colewhite) [23:37:39] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Dzahn) 05Open→03Resolved in syslog there were no memory errors this time either. just stops and then continues. but in DRAC: ` ------------------------------------------------------------------------------... [23:38:27] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10colewhite) [23:40:13] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: connect atlas-ulsfo to scs-ulsfo - https://phabricator.wikimedia.org/T206185 (10RobH) [23:40:15] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [23:41:49] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [23:46:41] (03PS4) 10Tim Starling: Add conditional loading of Parsoid/PHP as an extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) (owner: 10Subramanya Sastry) [23:49:44] (03CR) 10Tim Starling: "PS4: fix phpcs style error" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) (owner: 10Subramanya Sastry) [23:50:42] PROBLEM - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100% [23:50:49] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) My last comment on July 8 was "I swapped B2 with A2, no more error. leaving this task open for a week. If we do have the same problem on A2, I will request a replacement." It looks like we do have the err... [23:51:12] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) 05Resolved→03Open [23:53:31] (03CR) 10Tim Starling: [C: 03+2] Set up a multiversion rest.php endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528806 (https://phabricator.wikimedia.org/T229356) (owner: 10Subramanya Sastry) [23:53:49] (03CR) 10Tim Starling: [C: 03+2] Add conditional loading of Parsoid/PHP as an extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) (owner: 10Subramanya Sastry) [23:54:20] (03Merged) 10jenkins-bot: Set up a multiversion rest.php endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528806 (https://phabricator.wikimedia.org/T229356) (owner: 10Subramanya Sastry) [23:54:43] (03Merged) 10jenkins-bot: Add conditional loading of Parsoid/PHP as an extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) (owner: 10Subramanya Sastry) [23:55:36] ACKNOWLEDGEMENT - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T227298 [23:56:23] (03CR) 10jenkins-bot: Set up a multiversion rest.php endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528806 (https://phabricator.wikimedia.org/T229356) (owner: 10Subramanya Sastry) [23:58:08] !log tstarling@deploy1001 Synchronized w/rest.php: Creating rest.php endpoint disabled by default (duration: 00m 55s) [23:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:33] ACKNOWLEDGEMENT - Host poolcounter2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T229998#5398688