[00:49:42] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10Mooeypoo) Thank you everyone for all the feedback and back and forth, and I apologize for any m... [03:59:46] (03PS4) 10Andrew Bogott: cloud-vps hiera: introduce openstack_controllers and keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/589095 (https://phabricator.wikimedia.org/T249941) [04:00:27] (03PS11) 10Andrew Bogott: glance image_sync: use primary_glance_image_store to choose the image store [puppet] - 10https://gerrit.wikimedia.org/r/589096 (https://phabricator.wikimedia.org/T249941) [04:03:17] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps hiera: introduce openstack_controllers and keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/589095 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [04:07:56] (03CR) 10jerkins-bot: [V: 04-1] glance image_sync: use primary_glance_image_store to choose the image store [puppet] - 10https://gerrit.wikimedia.org/r/589096 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [04:27:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:29:07] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:36:29] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:36:31] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:05:01] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:07:21] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:07:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:10:37] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:02] the interface down between cr2-codfw and eqiad is a Zayo transport, there is maintenance scheduled [05:49:52] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::kubernetes: add the puppet CA cert to general.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [05:57:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] maintenance: Migrate purge_checkuser to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589377 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [06:02:38] (03CR) 10Elukey: [C: 03+2] Use -XX:NewRatio=3 for cloudelastic-chi instead of setting a specific size [puppet] - 10https://gerrit.wikimedia.org/r/589356 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [06:02:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] maintenance: Migrate purge_abusefilter to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589369 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [06:04:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1092 after compression', diff saved to https://phabricator.wikimedia.org/P11001 and previous config saved to /var/cache/conftool/dbconfig/20200417-060419-marostegui.json [06:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] maintenance: Migrate purged_expired_userrights to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589378 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [06:06:31] (03PS1) 10Marostegui: install_server: Do not reimage db1114 [puppet] - 10https://gerrit.wikimedia.org/r/589468 [06:10:23] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1114 [puppet] - 10https://gerrit.wikimedia.org/r/589468 (owner: 10Marostegui) [06:15:03] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:15:03] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:19:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1092 after compression', diff saved to https://phabricator.wikimedia.org/P11002 and previous config saved to /var/cache/conftool/dbconfig/20200417-061907-marostegui.json [06:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:34] (03CR) 10Elukey: Designate: replace standalone memcached with a mcrouter cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [06:20:03] wow also cr1? [06:20:28] that should be Telia [06:20:55] yeah interface down [06:21:20] we have maintenance for both scheduled in gcal, just realized [06:22:28] we have the eqdfw and eqord paths to connect eqiad/codfw so not a big deal, but let's keep an eye on this [06:25:07] (03CR) 10Vgutierrez: "pcc looks almost like a NOOP on production: https://puppet-compiler.wmflabs.org/compiler1001/21999/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589291 (owner: 10Alex Monk) [06:25:31] (03PS1) 10Ema: Add metric 'purged_udp_bytes_read_total' [software/purged] - 10https://gerrit.wikimedia.org/r/589470 [06:25:37] (03PS1) 10Ema: multicast: test URL extraction [software/purged] - 10https://gerrit.wikimedia.org/r/589471 [06:26:23] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10Marostegui) The `vslow` solution for a few ocasional queries also works for me [06:26:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1092 after compression', diff saved to https://phabricator.wikimedia.org/P11003 and previous config saved to /var/cache/conftool/dbconfig/20200417-062642-marostegui.json [06:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1111 from API', diff saved to https://phabricator.wikimedia.org/P11004 and previous config saved to /var/cache/conftool/dbconfig/20200417-063038-marostegui.json [06:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1092 after compression', diff saved to https://phabricator.wikimedia.org/P11005 and previous config saved to /var/cache/conftool/dbconfig/20200417-063138-marostegui.json [06:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:40] (03PS1) 10Elukey: elasticsearch::instance: avoid UseCMSInitiatingOccupancyOnly [puppet] - 10https://gerrit.wikimedia.org/r/589472 (https://phabricator.wikimedia.org/T231517) [06:35:31] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:39:13] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:40:03] (03PS1) 10Vgutierrez: ATS: Track current client|server transactions [puppet] - 10https://gerrit.wikimedia.org/r/589473 (https://phabricator.wikimedia.org/T249335) [06:58:05] 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10elukey) @bmansurov thanks for following up! What I'd start doing is to log 50x errors from the MW api in the service logs if possible, so people can easily get what is happening when the... [07:00:40] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10elukey) ES cluster green! Great work everybody :) [07:15:41] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops, 10Wikimedia-Incident: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10Joe) >>! In T245170#5916911, @WDoranWMF wrote: > Moving this to feature requests for PMs to... [07:22:02] (03PS2) 10Ema: multicast: test URL extraction [software/purged] - 10https://gerrit.wikimedia.org/r/589471 [07:23:31] (03CR) 10Ema: [C: 03+1] ATS: Track current client|server transactions [puppet] - 10https://gerrit.wikimedia.org/r/589473 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [07:25:26] (03PS2) 10Ema: Add metric 'purged_udp_bytes_read_total' [software/purged] - 10https://gerrit.wikimedia.org/r/589470 [07:25:28] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) 05Open→03Stalled [07:25:31] (03PS3) 10Ema: multicast: test URL extraction [software/purged] - 10https://gerrit.wikimedia.org/r/589471 [07:25:33] 10Operations, 10DBA, 10Goal: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) [07:25:49] 10Operations, 10DBA, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) [07:26:16] (03PS1) 10Jcrespo: mariadb-backups: Skip generation of backups for read-only es sections [puppet] - 10https://gerrit.wikimedia.org/r/589534 (https://phabricator.wikimedia.org/T79922) [07:26:40] (03PS2) 10Vgutierrez: ATS: Track current client and server transactions [puppet] - 10https://gerrit.wikimedia.org/r/589473 (https://phabricator.wikimedia.org/T249335) [07:26:50] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10fgiunchedi) 05Stalled→03Resolved Sounds like we have consensus! Thank you to all involved [07:29:17] (03CR) 10Dzahn: [C: 03+2] add contint.wikimedia.org service alias for contint machines [dns] - 10https://gerrit.wikimedia.org/r/589285 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [07:29:22] (03PS3) 10Dzahn: add contint.wikimedia.org service alias for contint machines [dns] - 10https://gerrit.wikimedia.org/r/589285 (https://phabricator.wikimedia.org/T210411) [07:29:48] (03CR) 10Dzahn: [C: 03+2] "per brief chat with traffic" [dns] - 10https://gerrit.wikimedia.org/r/589285 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [07:34:20] Now its task has been closed, shouldn’t https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/585458/ be abandoned [07:35:20] RhinosF1: looks like it, yes [07:35:36] jynus: mind if we abandond your change ? ^ seems not needed anymore [07:36:22] (03CR) 10Dzahn: "related ticket is closed as resolved and this doesn't seem to be needed anymore. can probably be abandoned." [puppet] - 10https://gerrit.wikimedia.org/r/585458 (https://phabricator.wikimedia.org/T249059) (owner: 10Jcrespo) [07:36:49] Thanks dzahn [07:37:02] 10Operations, 10LDAP-Access-Requests: LDAP/NDA Access Request for mshaver - https://phabricator.wikimedia.org/T250430 (10fgiunchedi) Hi @MNoorWMF, I see from your Phabricator profile that your username used to be mnoor ? We have mnoor in the correct group but not mshaver. I take it mnoor is no longer in use? [07:37:18] 10Operations, 10LDAP-Access-Requests: LDAP/NDA Access Request for mshaver - https://phabricator.wikimedia.org/T250430 (10fgiunchedi) p:05Triage→03Medium [07:37:54] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Skip generation of backups for read-only es sections [puppet] - 10https://gerrit.wikimedia.org/r/589534 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [07:38:24] (03PS1) 10JMeybohm: Update .ruby-version to what is running in production [puppet] - 10https://gerrit.wikimedia.org/r/589539 [07:38:48] (03PS1) 10Ema: cache: use purged on cache_text [puppet] - 10https://gerrit.wikimedia.org/r/589540 (https://phabricator.wikimedia.org/T249325) [07:39:47] (03CR) 10Vgutierrez: [C: 03+2] ATS: Track current client and server transactions [puppet] - 10https://gerrit.wikimedia.org/r/589473 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [07:39:51] (03PS2) 10Ema: cache: use purged on cache_text [puppet] - 10https://gerrit.wikimedia.org/r/589540 (https://phabricator.wikimedia.org/T249325) [07:39:51] mutante: sure [07:41:09] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for awight - https://phabricator.wikimedia.org/T250364 (10fgiunchedi) p:05Triage→03Medium [07:41:17] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Christoph Jauera - https://phabricator.wikimedia.org/T250362 (10fgiunchedi) p:05Triage→03Medium [07:41:26] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10fgiunchedi) p:05Triage→03Medium [07:41:40] (03PS2) 10Dzahn: ATS: use contint service alias as backend for integration.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/588973 (https://phabricator.wikimedia.org/T224591) [07:41:49] 10Operations, 10DBA: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10fgiunchedi) p:05Triage→03Medium [07:42:13] jynus: ack, done [07:42:19] (03Abandoned) 10Dzahn: admin: Add analytics private group to tchanders, dmaza, dbarratt, wikigit [puppet] - 10https://gerrit.wikimedia.org/r/585458 (https://phabricator.wikimedia.org/T249059) (owner: 10Jcrespo) [07:44:22] (03CR) 10Ema: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/22000/" [puppet] - 10https://gerrit.wikimedia.org/r/589540 (https://phabricator.wikimedia.org/T249325) (owner: 10Ema) [07:45:22] (03PS1) 10Giuseppe Lavagetto: mediawiki::php::admin: allow inspecting ini values [puppet] - 10https://gerrit.wikimedia.org/r/589541 [07:45:24] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Skip generation of backups for read-only es sections [puppet] - 10https://gerrit.wikimedia.org/r/589534 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [07:45:25] !log restart wdqs-updater on all nodes after deployment [07:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:10] (03PS3) 10Ema: cache: use purged on cache_text [puppet] - 10https://gerrit.wikimedia.org/r/589540 (https://phabricator.wikimedia.org/T249325) [07:50:40] (03CR) 10Ema: [C: 03+2] cache: use purged on cache_text [puppet] - 10https://gerrit.wikimedia.org/r/589540 (https://phabricator.wikimedia.org/T249325) (owner: 10Ema) [07:53:32] 10Operations, 10Wikimedia-Mailing-lists: Mailing list request for WPWP (Wikipedia Pages Wanting Photos) - https://phabricator.wikimedia.org/T250390 (10fgiunchedi) p:05Triage→03Medium Mailing list has been created, see also https://lists.wikimedia.org/mailman/admin/wpwp. The initial admin password has been... [07:54:01] !log cache_text: puppet run to stop vhtcpd and start purged T249325 [07:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:08] T249325: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 [07:54:41] 10Operations, 10Wikimedia-Mailing-lists: Mailing list request for WPWP (Wikipedia Pages Wanting Photos) - https://phabricator.wikimedia.org/T250390 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Tentatively resolving, feel free to reopen if needed [08:08:32] (03PS1) 10Elukey: statistics::rsync::mediawiki: reduce retention and improve security [puppet] - 10https://gerrit.wikimedia.org/r/589542 (https://phabricator.wikimedia.org/T249754) [08:10:00] (03CR) 10Muehlenhoff: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/588973 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:10:35] <_joe_> ema: won't you need to restart the caches? you're losing some millions of purges AIUI, right? [08:11:02] <_joe_> I mean, they're so backlogged that they're all stale by now, I guess [08:12:05] (03PS1) 10Dzahn: delete unused rt.wikimedia.org.key [labs/private] - 10https://gerrit.wikimedia.org/r/589543 [08:12:55] (03PS1) 10Dzahn: add fake contint.wikimedia.org key [labs/private] - 10https://gerrit.wikimedia.org/r/589544 [08:13:24] (03CR) 10Dzahn: [V: 03+2 C: 03+2] delete unused rt.wikimedia.org.key [labs/private] - 10https://gerrit.wikimedia.org/r/589543 (owner: 10Dzahn) [08:13:38] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/22002/stat1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/589542 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [08:13:51] _joe_: I don't think DCs other than esams are that backlogged (and esams has been running purged for 24h now, incidentally also the ttl_cap which we're now enforcing at the ats-be layer too) [08:13:57] so we should be alright [08:14:06] (03CR) 10Dzahn: [C: 03+2] "just created this new cert in the private repo" [labs/private] - 10https://gerrit.wikimedia.org/r/589544 (owner: 10Dzahn) [08:14:09] <_joe_> oh [08:14:16] <_joe_> we're enforcing the TTL cap already? [08:14:18] <_joe_> great [08:14:33] yup! found yesterday that there is indeed a knob in ats for that [08:14:56] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake contint.wikimedia.org key [labs/private] - 10https://gerrit.wikimedia.org/r/589544 (owner: 10Dzahn) [08:15:04] !log dropping wikidatawiki.wb_items_per_site_old table in codfw T250345 [08:15:06] <_joe_> let's keep an eye on the cache hit ratios [08:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:11] T250345: Drop wikidatawiki.wb_items_per_site_old from s8 hosts - https://phabricator.wikimedia.org/T250345 [08:16:07] _joe_: yup, https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?orgId=1&refresh=15m is useful to stare at [08:16:11] (03PS3) 10Dzahn: ATS: use contint service alias as backend for integration.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/588973 (https://phabricator.wikimedia.org/T224591) [08:17:10] ^^ mutante no TLS available yet on contint? [08:17:32] 10Operations, 10Traffic, 10Goal, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Gilles) >>! In T170567#6062022, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/g-SWgnEBv7KcG9M-vpmM} [2020-04-16T... [08:17:34] vgutierrez: coming soon. i just made the cert for that and next https://gerrit.wikimedia.org/r/c/operations/puppet/+/588980 [08:17:47] vgutierrez: i just did not want to keep adding stuff to the buster migration .. but now doign both [08:18:09] it's one of the last remaining ones on the "applayer without TLS" tickets [08:18:23] 10Operations, 10Traffic, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Gilles) >>! In T249335#6062023, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/EfuXgnEBj_Bg1xd3h9Ty} [2020-04-16T10:44... [08:19:22] creating the contint.wikimedia.org cname is also part of that to have it as a SAN on the cert [08:20:19] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:20:29] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:12] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/589400 (owner: 10Ottomata) [08:26:14] (03CR) 10Dzahn: [C: 03+2] ATS: use contint service alias as backend for integration.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/588973 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:26:23] (03CR) 10Dzahn: [C: 03+2] "no difference with curl from cp1075" [puppet] - 10https://gerrit.wikimedia.org/r/588973 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:27:56] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Gilles) Do you think this is Flask-specific? How hard would it be to port this code to another Python micro WS... [08:28:57] (03PS1) 10Elukey: role::statistics: share mediawiki/eventlogging profiles [puppet] - 10https://gerrit.wikimedia.org/r/589549 (https://phabricator.wikimedia.org/T249754) [08:30:11] (03PS2) 10Elukey: role::statistics: share mediawiki/eventlogging profiles [puppet] - 10https://gerrit.wikimedia.org/r/589549 (https://phabricator.wikimedia.org/T249754) [08:31:57] (03PS2) 10Dzahn: ci::master: add envoy for TLS termination for integration [puppet] - 10https://gerrit.wikimedia.org/r/588980 (https://phabricator.wikimedia.org/T210411) [08:34:00] (03PS3) 10Dzahn: ci::master: add envoy for TLS termination for integration [puppet] - 10https://gerrit.wikimedia.org/r/588980 (https://phabricator.wikimedia.org/T210411) [08:35:45] (03CR) 10Dzahn: [C: 04-1] "argg.. yes..the jessie limitation. Envoy can only work with unprivileged ports under jessie" [puppet] - 10https://gerrit.wikimedia.org/r/588980 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [08:36:00] (03CR) 10jerkins-bot: [V: 04-1] role::statistics: share mediawiki/eventlogging profiles [puppet] - 10https://gerrit.wikimedia.org/r/589549 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [08:36:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Update .ruby-version to what is running in production [puppet] - 10https://gerrit.wikimedia.org/r/589539 (owner: 10JMeybohm) [08:37:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] maintenance: Migrate purge_securepoll to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589384 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [08:39:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] maintenance: Migrate purge_old_cx_drafts to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589379 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [08:39:37] (03PS3) 10Elukey: role::statistics: share mediawiki/eventlogging profiles [puppet] - 10https://gerrit.wikimedia.org/r/589549 (https://phabricator.wikimedia.org/T249754) [08:41:49] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/22007/" [puppet] - 10https://gerrit.wikimedia.org/r/589549 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [08:42:19] (03PS4) 10Dzahn: ci::master: add envoy for TLS termination for integration [puppet] - 10https://gerrit.wikimedia.org/r/588980 (https://phabricator.wikimedia.org/T210411) [08:42:23] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10MoritzMuehlenhoff) >>! In T249916#6056426, @MoritzMuehlenhoff wrote: >> ** Access to the mgmt IP network remotely. Right now that's firewalled to the cumin hosts, acces... [08:43:19] (03CR) 10JMeybohm: [C: 03+2] Update .ruby-version to what is running in production [puppet] - 10https://gerrit.wikimedia.org/r/589539 (owner: 10JMeybohm) [08:46:26] (03CR) 10Elukey: [C: 03+2] role::statistics: share mediawiki/eventlogging profiles [puppet] - 10https://gerrit.wikimedia.org/r/589549 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [08:47:52] 10Operations, 10homer, 10netops, 10Patch-For-Review: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 (10ayounsi) [08:47:54] 10Operations: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367 (10ayounsi) [08:50:03] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:31] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:01] !log dropping wikidatawiki.wb_items_per_site_old table in eqiad (non-labs hosts) T250345 [09:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:07] T250345: Drop wikidatawiki.wb_items_per_site_old from s8 hosts - https://phabricator.wikimedia.org/T250345 [09:00:24] (03PS1) 10Vgutierrez: ATS: Disable KA on cp1077 [puppet] - 10https://gerrit.wikimedia.org/r/589551 (https://phabricator.wikimedia.org/T248938) [09:00:26] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for andrew-wmde - https://phabricator.wikimedia.org/T249733 (10Andrew-WMDE) @fgiunchedi Thank you, everything appears to be working! [09:01:05] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/22008/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/588980 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [09:03:03] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1003/22009/" [puppet] - 10https://gerrit.wikimedia.org/r/589551 (https://phabricator.wikimedia.org/T248938) (owner: 10Vgutierrez) [09:04:55] (03CR) 10Ema: [C: 03+1] ATS: Disable KA on cp1077 [puppet] - 10https://gerrit.wikimedia.org/r/589551 (https://phabricator.wikimedia.org/T248938) (owner: 10Vgutierrez) [09:05:50] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable KA on cp1077 [puppet] - 10https://gerrit.wikimedia.org/r/589551 (https://phabricator.wikimedia.org/T248938) (owner: 10Vgutierrez) [09:07:39] !log disable KA between ats-tls and varnish-fe on cp1077 - T248938 [09:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:46] T248938: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 [09:07:58] *sigh* [09:08:00] wrong task number [09:11:51] 10Operations, 10Wikimedia-Mailing-lists: Mailing list request for WPWP (Wikipedia Pages Wanting Photos) - https://phabricator.wikimedia.org/T250390 (10Wikicology) Thank you Fgiunchedi. I have received the notification. Regards. [09:12:15] vgutierrez: you can edit SAL in wiki [09:12:22] (already done, thanks) [09:13:28] (03CR) 10Dzahn: [C: 03+2] ci::master: add envoy for TLS termination for integration [puppet] - 10https://gerrit.wikimedia.org/r/588980 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [09:14:06] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for andrew-wmde - https://phabricator.wikimedia.org/T249733 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Fantastic, thank you @Andrew-WMDE. Resolving [09:15:40] vgutierrez: ^ adding envoy right now on the backends. will probably do the switch to https on Monday [09:15:50] mutante: ack <3 [09:16:28] (03PS1) 10Elukey: role::statistics::explorer: add profiles to match role::statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/589553 (https://phabricator.wikimedia.org/T249754) [09:17:25] aww. except E: Unable to locate package getenvoy-envoy [09:17:33] (03CR) 10Elukey: "Gilles/David/Erik: if you like the idea after this patch you'll likely have to add more target to the scap config of your repos!" [puppet] - 10https://gerrit.wikimedia.org/r/589553 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [09:18:02] this is jessie but in the past on OTRS we were able to use it [09:18:35] the requirement for the additional package is new i think [09:20:45] !log imported helm 2.12.2 to main for buster-wikimedia [09:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:52] (03PS1) 10JMeybohm: admin: Update dotfiles for jayme [puppet] - 10https://gerrit.wikimedia.org/r/589554 [09:28:06] jayme: :)) very nice, that will be great for contint2001 [09:29:33] (03CR) 10JMeybohm: [C: 03+2] admin: Update dotfiles for jayme [puppet] - 10https://gerrit.wikimedia.org/r/589554 (owner: 10JMeybohm) [09:31:04] mutante: helm-diff is still missing though, looking into that now [09:31:15] jayme: ack, thank you! [09:33:06] (03PS1) 10Dzahn: add certificate for contint/integration.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/589556 (https://phabricator.wikimedia.org/T210411) [09:45:59] (03PS1) 10Jbond: profile::idp: provide undef default value for Optional params [puppet] - 10https://gerrit.wikimedia.org/r/589559 [09:49:48] (03PS3) 10Ema: vcl: introduce wm_admission_policies [puppet] - 10https://gerrit.wikimedia.org/r/588945 (https://phabricator.wikimedia.org/T249809) [09:49:50] (03PS2) 10Ema: vcl: move 'exp' admission policy to wm_admission_policies [puppet] - 10https://gerrit.wikimedia.org/r/589341 (https://phabricator.wikimedia.org/T249809) [09:49:52] (03PS4) 10Ema: vcl: 10M cutoff for the 'exp' admission policy [puppet] - 10https://gerrit.wikimedia.org/r/589342 (https://phabricator.wikimedia.org/T249809) [09:50:21] (03CR) 10Jbond: [C: 03+2] profile::idp: provide undef default value for Optional params [puppet] - 10https://gerrit.wikimedia.org/r/589559 (owner: 10Jbond) [09:51:01] (03CR) 10Dzahn: [C: 03+2] add certificate for contint/integration.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/589556 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [09:54:01] !log enabling replication from pc1007 to pc1010 T247787 [09:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:07] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [09:55:47] 10Operations, 10Wikimedia-Mailing-lists: Request: Private mailing list - https://phabricator.wikimedia.org/T250470 (10Bugreporter) [09:55:57] 10Operations, 10Wikimedia-Mailing-lists: Request: Private mailing list - https://phabricator.wikimedia.org/T250470 (10Wikicology) [09:57:01] (03PS5) 10Ema: vcl: 10M cutoff for the 'exp' admission policy [puppet] - 10https://gerrit.wikimedia.org/r/589342 (https://phabricator.wikimedia.org/T249809) [09:57:44] 10Operations, 10Wikimedia-Mailing-lists: Request private mailing list wpwp-organizers@ - https://phabricator.wikimedia.org/T250470 (10Majavah) [10:00:41] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:01] buuuu [10:02:54] ah it is a notebook unit, will reset-failed it [10:03:00] 10Operations: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367 (10ayounsi) So for hosts that say something like: `Port description : NIC 10Gb SFP+ DA` one theory is that the NIC has an embedded LLDP daemon that prevents the host one to work properly. There are some flags mention... [10:03:20] elukey: we should put more peer pressure on icinga by regularly insulting it and hoping it will then think twice before alerting in public like this [10:03:25] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:35] ema: I know! [10:04:36] (03PS1) 10ArielGlenn: add machine vision tables dump to the web page for 'other' dumps [puppet] - 10https://gerrit.wikimedia.org/r/589561 (https://phabricator.wikimedia.org/T236431) [10:07:25] !log change pc2010 to replicate from pc1010 T247787 [10:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:31] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [10:07:31] (03CR) 10ArielGlenn: [C: 03+2] add machine vision tables dump to the web page for 'other' dumps [puppet] - 10https://gerrit.wikimedia.org/r/589561 (https://phabricator.wikimedia.org/T236431) (owner: 10ArielGlenn) [10:07:49] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:55] <_joe_> lol [10:07:58] <_joe_> elukey: ^^ [10:08:15] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:29] (03CR) 10Ema: "Rendered VCL looks reasonable to me: https://puppet-compiler.wmflabs.org/compiler1001/22011/cp3050.esams.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/589342 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [10:08:31] PROBLEM - Check no envoy runtime configuration is left persistent on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:08:40] <_joe_> mutante: ^^ [10:08:55] PROBLEM - Check that envoy is running on contint1001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:09:59] _joe_: arrg, thanks. on it now [10:12:59] (03PS1) 10Hnowlan: changeprop: Use staging eventgate in staging environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/589562 (https://phabricator.wikimedia.org/T249739) [10:13:36] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Kormat) Magic happened: - Optimized tables on pc1010 to free up some space (it halved disk usage) - Moved pc1010 to replicate under pc1007 -... [10:16:36] <_joe_> !log contint1001:~$ sudo /usr/local/sbin/build-envoy-config -c /etc/envoy [10:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:10] <_joe_> !log contint1001:~$ sudo systemctl restart envoyproxy.service [10:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:40] _joe_: ah, i was kind of expecting that but my config did exist in listeners.d [10:17:47] <_joe_> yes [10:17:52] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:53] RECOVERY - Check that envoy is running on contint1001 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:17:57] <_joe_> but that needs to be "compiled" in envoy.yaml [10:18:04] <_joe_> and that happens via build-envoy-config [10:18:11] <_joe_> not sure why it didn't run [10:18:39] ack, thanks for the fix [10:19:37] using 1443 because it's jessie. but once we upgraded to buster i can change it to 443 [10:20:56] RECOVERY - Check no envoy runtime configuration is left persistent on contint1001 is OK: HTTP OK: HTTP/1.1 200 OK - 306 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:21:34] rescheduled the check for that, all green [10:27:10] 10Operations: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367 (10ayounsi) [10:28:40] 10Operations, 10Wikimedia-Mailing-lists: Request private mailing list wpwp-organizers@ - https://phabricator.wikimedia.org/T250470 (10Aklapper) Hi @Wikicology. In the future, please follow https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list for required information. Thanks! :) [10:31:20] PROBLEM - Check no envoy runtime configuration is left persistent on idp-test2001 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:31:46] PROBLEM - Check that envoy is running on idp-test2001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:32:14] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 3 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10Urbanecm) >>! In T249325#6063043, @ema wrote: > @Urbanecm, @AntiCompositeNumber: esams h... [10:37:14] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:41:02] (03PS1) 10Dzahn: ATS: switch contint backend to use TLS [puppet] - 10https://gerrit.wikimedia.org/r/589565 (https://phabricator.wikimedia.org/T210411) [10:43:12] 10Operations, 10Wikimedia-Mailing-lists: Request private mailing list wpwp-organizers@ - https://phabricator.wikimedia.org/T250470 (10Wikicology) @Aklapper, noted with thanks. Regards. [10:46:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [10:47:36] 10Operations, 10Wikimedia-Mailing-lists: Request private mailing list wpwp-organizers@ - https://phabricator.wikimedia.org/T250470 (10Dzahn) You have successfully created the mailing list wpwp-organizers and notification has been sent to the list owner reachout2isaac@gmail.com. You can now: [[ https://lists.w... [10:48:24] (03CR) 10Jbond: [C: 03+2] ferm: Add status check [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [10:50:54] 10Operations, 10Wikimedia-Mailing-lists: Request private mailing list wpwp-organizers@ - https://phabricator.wikimedia.org/T250470 (10Dzahn) 05Open→03Resolved a:03Dzahn @Wikicology Here you go, you should have received mail with a random password. You can login with that on the admin page above. I adde... [11:00:11] (03PS3) 10Dzahn: merge microsites into webserver_misc_apps [puppet] - 10https://gerrit.wikimedia.org/r/587985 (https://phabricator.wikimedia.org/T247650) [11:00:18] (03PS13) 10Hnowlan: profile::kubernetes: add the puppet CA cert to general.yaml [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) [11:05:13] cumin alias: 'DC aliases do not cover all hosts: mw1253.eqiad.wmnet' .weird. that's one of hosts i decom'ed. but only this one and not the other ones done at the same time. and used cookbook [11:05:34] looking for any remnants [11:06:35] (03CR) 10Hnowlan: [C: 03+2] profile::kubernetes: add the puppet CA cert to general.yaml [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [11:08:59] 10Operations, 10DBA, 10Goal: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) Proposed backup workflow strategy for content and metadata database backups: {F31761425} [11:09:03] mutante: it's not in puppetdb [11:10:02] volans: hm. but isn't that where "all_hosts" comes from? [11:10:08] and then it does all_hosts - hosts ? [11:11:00] I don't see it in site.pp [11:11:11] is that host still alive/ in prod? was it decommes? [11:11:12] yea, that's expected [11:11:13] *decommed [11:11:39] 10Operations, 10DBA, 10Goal: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10Marostegui) Does that mean we won't be storing metadata/content backup within the same DC and only on the remote DC? [11:12:12] fully decom'ed together with mw1250 - mw1252 which don't show up here [11:13:11] volans: it comes from cumin2001 not cumin1001 if that can be a difference [11:13:22] do you have the decom task? [11:13:27] nah it shows up on both [11:13:39] 11:39 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [11:13:42] 11:37 mutante: decom mw1250 - mw1253 [11:14:48] might be a race condition [11:14:50] curl -Gs https://puppetdb1002.eqiad.wmnet/pdb/query/v4/nodes/mw1253.eqiad.wmnet [11:15:04] still has a partial [11:15:16] weird [11:16:08] volans: https://phabricator.wikimedia.org/T247780#5997935 [11:17:00] should i just run the cookbook another time and we see if it's gone? [11:17:07] let me debug a moment [11:17:10] ok [11:17:16] the host is gone so harder :D [11:18:48] mutante: ah that's from march 25th [11:18:52] do you have IRC highlighting for the word "cumin" ?:) [11:18:58] most logs have been rotated [11:19:02] ah [11:19:17] that's a secret ;) [11:19:45] yea, it's been a while. and the alias-check mail is being sent more often than that. so why did it not pop up earlier [11:19:47] for the sake of a test try to re-run the cookbook, I'm not sure it would work, let's see [11:19:48] heheee [11:19:55] ok [11:20:03] if not we just issue the puppet removal [11:20:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [11:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:53] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [11:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:00] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1253.eqiad.wmnet` - mw1253.eqiad.wmnet (**FAIL**) - Host steps raised exception: Empty... [11:21:36] no, it's still there [11:21:42] empty mgmt passw [11:21:46] it didn't run at all [11:21:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [11:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:15] Sleeping for 20s to avoid race conditions... [11:22:19] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [11:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:25] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1253.eqiad.wmnet` - mw1253.eqiad.wmnet (**FAIL**) - Failed downtime host on Icinga (li... [11:22:34] Removed from Puppet master and PuppetDB [11:22:39] but curl still shows it [11:23:10] !log dzahn@cumin2001 START - Cookbook sre.hosts.decommission [11:23:10] !log dzahn@cumin2001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [11:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:40] volans: is it expected to work the same on cumin2001 as on cumin1001 ? [11:23:48] yes [11:24:01] }[cumin2001:~] $ sudo cookbook sre.hosts.decommission mw1253.eqiad.wmnet -t T247780 [11:24:02] T247780: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 [11:24:07] spicerack.remote.RemoteError: No hosts provided [11:24:12] ^ this is different there [11:25:56] interesting [11:25:56] auth.log:119411:Apr 15 09:47:38 puppetmaster1001 sudo: jbond : TTY=pts/5 ; PWD=/home/jbond ; USER=root ; COMMAND=/usr/bin/puppet lookup ldap --node mw1253.eqiad.wmnet --explain --compile [11:26:27] maybe it's that [11:26:35] whats the matter? [11:26:52] hey John, so mw1253 was decommed [11:26:59] now it partially reappeard in puppetdb [11:27:09] try: curl -Gs https://puppetdb1002.eqiad.wmnet/pdb/query/v4/nodes/mw1253.eqiad.wmnet from a cumin host [11:27:25] and I foudn that log lines from you, maybe that re-created some form of knowledge of the host in puppet? [11:27:50] ahh i ran that command yesterday just trying to test lookup, picked something from my history. however as i passed the compile flag it probab ly did send some tuff to the puppetdb [11:28:01] it manifested as "cumin-alias: DC aliases do not cover all hosts: mw1253.eqiad.wmnet" email [11:28:33] that explains why it did not show up earlier than today [11:28:48] 10Operations, 10DBA, 10Goal: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) > Does that mean we won't be storing metadata/content backup within the same DC and only on the remote DC There will be backups available within the same dc, the freshest ones, on dbpro... [11:28:59] * volans lunch [11:29:41] mutante: is it causing problems, if not it will naturaly go away or we can purge it [11:30:03] jbond42: not more than having to ignore an email per day [11:30:29] ack lets iognore it if its not cleaned by monday ill purge it [11:30:45] sounds good to me. thanks! [11:30:51] np cheers [11:32:01] (03PS1) 10Hnowlan: changeprop: Use global puppet CA cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/589570 (https://phabricator.wikimedia.org/T249633) [11:33:44] !log imported helm-diff 2.11.0+3-2+deb10u1 to main for buster-wikimedia [11:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:38] !log contint2001 - apt-get update, run puppet to install helm-diff [11:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:18] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Rebuild helm/helm-diff for buster-wikimedia - https://phabricator.wikimedia.org/T249812 (10JMeybohm) 05Open→03Resolved imp... [11:37:22] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10JMeybohm) [11:38:18] 10Operations, 10Traffic, 10Goal, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup) That's weird, TLSv1.3 is famous for being faster than v1.2. [11:38:43] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) >>! In T224591#6042322, @Dzahn wrote: >... [11:39:57] 10Operations, 10DBA, 10Goal: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) The main question is not really if we should do this or not- we should have bacula redundancy among dcs, the question is if we should cross-dc send the bacula copies or be 100% independe... [11:48:47] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Rebuild helmfile for buster-wikimedia - https://phabricator.wikimedia.org/T250479 (10Dzahn) [11:50:42] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Rebuild helmfile for buster-wikimedia - https://phabricator.wikimedia.org/T250479 (10Dzahn) a:05Dzahn→03None [11:56:17] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) [12:03:57] jbond42: what makes you think mw1253 will be automatically removed by monday? our auto-delete from puppetdb is 15 days IIRC [12:05:07] 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10bmansurov) >>! In T247732#6064844, @elukey wrote: > @bmansurov thanks for following up! What I'd start doing is to log 50x errors from the MW api in the service logs if possible, so peop... [12:07:49] volans: for some reason i thought we set node-purge-ttl to a much lower value but i see its unset so yes it would be 14 days [12:09:11] also the cookbook does puppet node clean and deactivate [12:09:21] and clearly that didn't do anything as daniel re-run it today [12:10:03] oh i didn;t see it had allready been deactivated [12:10:37] ill take a look [12:10:59] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/22015/miscweb1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/587985 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [12:16:52] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for awight - https://phabricator.wikimedia.org/T250364 (10Tobi_WMDE_SW) @awight is in my team and I approve of his request. [12:20:57] 10Operations, 10DBA, 10Goal: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10Marostegui) Thanks for the clarification, from the diagram it wasn't entirely clear to me if the short-term backups would be kept locally (hence my question :-) ) or not. I also like more the i... [12:26:05] !og copied kubernetes-client from stretch-wikimedia to buster-wikimedia T224591 [12:26:06] T224591: Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 [12:26:20] moritzm: missing 'l' in log ;) [12:27:12] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10MoritzMuehlenhoff) kubernetes-client just ship... [12:27:42] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [12:28:06] !log copied kubernetes-client from stretch-wikimedia to buster-wikimedia T224591 [12:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:17] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Christoph Jauera - https://phabricator.wikimedia.org/T250362 (10Tobi_WMDE_SW) Approving @WMDE-Fisch 's request! [12:28:19] volans: ood atch :-) [12:28:23] lol [12:34:48] 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10elukey) Yes sorry what I meant is if there is an explanation of the 404 in the logs, since it is not something that caught any eyes on when debugging why a service flaps for example. Any... [12:38:21] (03PS1) 10Filippo Giunchedi: graphite: django 2.2 compat [puppet] - 10https://gerrit.wikimedia.org/r/589576 (https://phabricator.wikimedia.org/T247963) [12:42:07] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) [12:45:42] RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [12:45:46] !log cntint2001 - restart nagios-nrpe-server [12:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:13] (03CR) 10Muehlenhoff: "That patch looks fine, but there's no real indication Django from backports gets pulled in, is it? graphite should just use the standard D" [puppet] - 10https://gerrit.wikimedia.org/r/589576 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [12:52:50] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/589576 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [12:54:32] !log contint2001 /usr/local/sbin/build-envoy-config -c /etc/envoy ; restart envoyproxy; was not listening on admin port [12:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:50] ACKNOWLEDGEMENT - Check no envoy runtime configuration is left persistent on idp-test2001 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [12:55:50] ACKNOWLEDGEMENT - Check that envoy is running on idp-test2001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [13:00:17] !log netbox1001 - sudo systemctl start netbox_ganeti_eqiad_sync (was failed) [13:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:03] mutante: that's a unit triggered by a systemd time, it's a one shot, it's never "running" [13:01:32] volans: but that doesn't mean it should be "failed" and alterting about systemd state [13:02:41] also "Check the last execution of netbox_ganeti_eqiad_sync" is CRIT as well [13:02:54] (03CR) 10Ppchelko: [C: 04-1] changeprop: Use staging eventgate in staging environment. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/589562 (https://phabricator.wikimedia.org/T249739) (owner: 10Hnowlan) [13:03:03] I agree but will be re-run in few minutes, so if it's failing should be investigated and if it was a one time failure know why so that the script can be made more resiliant (cc chaomodus ) [13:03:28] that last alert was CRIT for 21 hours though [13:04:53] :/ [13:06:16] (03CR) 10Ppchelko: changeprop: Use global puppet CA cert (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/589570 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [13:09:36] 10Operations, 10ops-eqiad: scb1001: Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T250482 (10ayounsi) p:05Triage→03Low [13:10:17] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on scb1001 is CRITICAL: 10 ge 4 Ayounsi https://phabricator.wikimedia.org/T250482 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1001&var-datasource=eqiad+prometheus/ops [13:10:26] 10Operations: Puppet certificate discrepancies - https://phabricator.wikimedia.org/T250483 (10Volans) p:05Triage→03Medium [13:11:41] (03PS1) 10Muehlenhoff: Add acmechief config for idp-test [puppet] - 10https://gerrit.wikimedia.org/r/589582 (https://phabricator.wikimedia.org/T233930) [13:13:24] volans: maybe related to cert renewals on ganeti ..looking and/or making ticket [13:13:58] trying to start a failed thing is part of investigation though [13:16:16] (03PS1) 10Muehlenhoff: Add CNAME for idp-test.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/589583 (https://phabricator.wikimedia.org/T233930) [13:18:52] 10Operations, 10Traffic, 10Goal, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Gilles) In theory with features like 0-RTT resumption of course, but that doesn't mean that implementation, configuration and the real world follow suite with the theory. I don't know... [13:19:37] 10Operations, 10DBA, 10Goal: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) > the diagram it wasn't entirely clear The summary is: "generate and keep short term locally, store long term on a geographically separate site." [13:22:16] (03CR) 10Ottomata: [C: 03+1] "I guess? These logs from from mwlog* hosts which are accessible by other groups outside of analytics, e.g. deployers and mw-log-readers." [puppet] - 10https://gerrit.wikimedia.org/r/589542 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [13:22:54] 10Operations: Puppet certificate discrepancies - https://phabricator.wikimedia.org/T250483 (10jbond) As far as i can tell the signed certs are not in the CRL either ` $ for host in $( (03CR) 10Andrew Bogott: "I'm pretty sure I want replicated pools. The total amount of data we're storing will be quite small but I want to maximize redundancy." [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [13:33:19] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) I have verified over videocall @Kormat's gpg KEYID and signed it. [13:35:12] (03CR) 10Ottomata: Refactor logstash::input::kafka to DRY ssl_truststore_location logic (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/589400 (owner: 10Ottomata) [13:36:02] (03PS7) 10Ottomata: Refactor logstash::input::kafka to DRY ssl_truststore_location logic [puppet] - 10https://gerrit.wikimedia.org/r/589400 [13:36:09] (03CR) 10Elukey: "> I'm pretty sure I want replicated pools. The total amount of data" [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [13:37:06] (03CR) 10Elukey: [C: 03+2] statistics::rsync::mediawiki: reduce retention and improve security [puppet] - 10https://gerrit.wikimedia.org/r/589542 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [13:39:32] 10Operations, 10DBA, 10Goal: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10Marostegui) Yep, that is good - thanks for the clarification :-) [13:40:27] 10Operations: Puppet certificate discrepancies - https://phabricator.wikimedia.org/T250483 (10jbond) I double checked db1105.eqiad.wmnet and i see that even though the certificate is not in the `/var/lib/puppet/server/ssl/ca/signed` it dose have the correct entry in `/var/lib/puppet/server/ssl/ca/inventory.txt`... [13:42:31] (03CR) 10Jbond: [C: 03+1] Add acmechief config for idp-test [puppet] - 10https://gerrit.wikimedia.org/r/589582 (https://phabricator.wikimedia.org/T233930) (owner: 10Muehlenhoff) [13:43:03] (03CR) 10Jbond: [C: 03+1] Add CNAME for idp-test.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/589583 (https://phabricator.wikimedia.org/T233930) (owner: 10Muehlenhoff) [13:43:06] (03PS15) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [13:43:31] (03CR) 10Mforns: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [13:43:51] 10Operations, 10Traffic, 10Goal, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) we haven't deployed 0-RTT at this time but even without it, a full TLSv1.3 handshake requires 1 RTT less than a full TLSv1.2 handshake. Thanks for the detailed report @Gil... [13:44:40] 10Operations, 10Traffic, 10Goal, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup) >>! In T170567#6065576, @Gilles wrote: > In theory with features like 0-RTT resumption of course, but that doesn't mean that implementation, configuration and the real worl... [13:54:14] !log Running VACUUM FULL for gis DB in maps2004.codfw.wmnet (which is depooled at the moment) [13:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:08] (03PS1) 10Jbond: ferm-status: handle port ranges [puppet] - 10https://gerrit.wikimedia.org/r/589596 [14:00:25] !log ganeti1003 - fixing gnt-rapi daemon not running [14:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:38] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:59] (03PS1) 10Ottomata: Collect eventgate error.validation topics into logstash [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) [14:01:27] !log netbox1001 - netbox_ganeti_eqiad_synx / systemd state fixed after gnt-rapi is runnign again on ganeti1003 [14:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:06] (03CR) 10Jbond: [C: 03+2] ferm-status: handle port ranges [puppet] - 10https://gerrit.wikimedia.org/r/589596 (owner: 10Jbond) [14:06:16] (03CR) 10Ottomata: Collect eventgate error.validation topics into logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [14:07:00] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:10:17] (03PS2) 10Ottomata: Collect eventgate error.validation topics into logstash [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) [14:11:01] volans: fixed ^. the issue was not on netbox side but on ganeti side [14:12:10] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/22016/" [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [14:14:06] (03CR) 10Ottomata: Collect eventgate error.validation topics into logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [14:14:16] (03PS1) 10Muehlenhoff: Fix installation of graphite-web on Buster [puppet] - 10https://gerrit.wikimedia.org/r/589599 (https://phabricator.wikimedia.org/T247963) [14:14:57] (03CR) 10jerkins-bot: [V: 04-1] Fix installation of graphite-web on Buster [puppet] - 10https://gerrit.wikimedia.org/r/589599 (https://phabricator.wikimedia.org/T247963) (owner: 10Muehlenhoff) [14:16:23] (03PS2) 10Muehlenhoff: Fix installation of graphite-web on Buster [puppet] - 10https://gerrit.wikimedia.org/r/589599 (https://phabricator.wikimedia.org/T247963) [14:16:55] (03PS1) 10Elukey: Deprecate statistics::rsync::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/589600 [14:19:02] !log add peer AS29802 to cr2-eqdfw and cr2-esams [14:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:05] (03CR) 10Ottomata: [C: 03+1] "Cool! Let's add a README file in the existent /srv/log dir on stat1007 pointing folks to Hive event.mediawiki_api_request table" [puppet] - 10https://gerrit.wikimedia.org/r/589600 (owner: 10Elukey) [14:20:13] (03PS3) 10Muehlenhoff: Fix installation of graphite-web on Buster [puppet] - 10https://gerrit.wikimedia.org/r/589599 (https://phabricator.wikimedia.org/T247963) [14:21:15] (03CR) 10Elukey: [C: 03+1] Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [14:21:38] mutante: ack, thanks! (cc chaomodus FYI) [14:23:47] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10Mholloway) [14:25:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/587985 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [14:26:59] (03CR) 10Muehlenhoff: "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/589599/ should fix this" [puppet] - 10https://gerrit.wikimedia.org/r/589576 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [14:29:31] !log ganeti2001 - kileld and restarted gnt-rapi process with the correct new key and cert [14:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:11] (03CR) 10Muehlenhoff: "Looks good, two comments inline." (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [14:41:47] (03CR) 10Jbond: "thanks see inline" (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [14:41:50] (03PS1) 10Ottomata: eventgate - No-op use .Values.puppet_ca_cert for kafka_ca_cert file [deployment-charts] - 10https://gerrit.wikimedia.org/r/589605 (https://phabricator.wikimedia.org/T249633) [14:42:05] (03CR) 10jerkins-bot: [V: 04-1] eventgate - No-op use .Values.puppet_ca_cert for kafka_ca_cert file [deployment-charts] - 10https://gerrit.wikimedia.org/r/589605 (https://phabricator.wikimedia.org/T249633) (owner: 10Ottomata) [14:43:44] (03PS2) 10Ottomata: eventgate - No-op use .Values.puppet_ca_cert for kafka_ca_cert file [deployment-charts] - 10https://gerrit.wikimedia.org/r/589605 (https://phabricator.wikimedia.org/T249633) [14:43:58] (03CR) 10jerkins-bot: [V: 04-1] eventgate - No-op use .Values.puppet_ca_cert for kafka_ca_cert file [deployment-charts] - 10https://gerrit.wikimedia.org/r/589605 (https://phabricator.wikimedia.org/T249633) (owner: 10Ottomata) [14:48:07] (03PS3) 10Ottomata: eventgate - No-op use .Values.puppet_ca_cert for kafka_ca_cert file [deployment-charts] - 10https://gerrit.wikimedia.org/r/589605 (https://phabricator.wikimedia.org/T249633) [14:48:53] (03CR) 10Ottomata: [C: 03+2] eventgate - No-op use .Values.puppet_ca_cert for kafka_ca_cert file [deployment-charts] - 10https://gerrit.wikimedia.org/r/589605 (https://phabricator.wikimedia.org/T249633) (owner: 10Ottomata) [14:50:00] (03PS1) 10Dzahn: ganeti: add monitoring for gnt-rapi daemon process [puppet] - 10https://gerrit.wikimedia.org/r/589608 [14:53:27] (03CR) 10jerkins-bot: [V: 04-1] ganeti: add monitoring for gnt-rapi daemon process [puppet] - 10https://gerrit.wikimedia.org/r/589608 (owner: 10Dzahn) [14:53:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: scb1001: Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T250482 (10Dzahn) [14:54:14] (03PS1) 10Ottomata: eventgate - use indent 4 when rendering kafka_ca_crt.pem [deployment-charts] - 10https://gerrit.wikimedia.org/r/589613 [14:54:52] (03CR) 10Ottomata: [C: 03+2] eventgate - use indent 4 when rendering kafka_ca_crt.pem [deployment-charts] - 10https://gerrit.wikimedia.org/r/589613 (owner: 10Ottomata) [14:59:34] (03PS1) 10Ottomata: eventgate - Fix puppet_ca_crt type [deployment-charts] - 10https://gerrit.wikimedia.org/r/589618 [14:59:58] (03CR) 10Ottomata: [C: 03+2] eventgate - Fix puppet_ca_crt type [deployment-charts] - 10https://gerrit.wikimedia.org/r/589618 (owner: 10Ottomata) [15:07:42] RECOVERY - Memory correctable errors -EDAC- on scb1001 is OK: (C)4 ge (W)2 ge 0 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1001&var-datasource=eqiad+prometheus/ops [15:14:33] (03CR) 10Ottomata: "You'll also need to specify to use the private/general.yaml helmfile in your services' helmfile.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/589570 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [15:16:51] (03PS1) 10Ottomata: Use private/general.yaml in event* hemlfile.yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/589621 (https://phabricator.wikimedia.org/T249633) [15:18:54] (03PS2) 10Ottomata: Use private/general.yaml in event* hemlfile.yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/589621 (https://phabricator.wikimedia.org/T249633) [15:19:50] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [15:20:39] !log remove cronjobs from mwmaint1002 previously updated to systemd timers and erroneously left in crontab -- diffs: https://phabricator.wikimedia.org/P11012 T211250 [15:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:47] T211250: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 [15:21:36] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:35:27] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/589096 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [15:37:16] (03PS12) 10Andrew Bogott: glance image_sync: use primary_glance_image_store to choose the image store [puppet] - 10https://gerrit.wikimedia.org/r/589096 (https://phabricator.wikimedia.org/T249941) [15:40:17] (03CR) 10Filippo Giunchedi: [C: 03+1] Refactor logstash::input::kafka to DRY ssl_truststore_location logic (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/589400 (owner: 10Ottomata) [15:40:39] (03CR) 10Ottomata: [C: 03+2] Use private/general.yaml in event* hemlfile.yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/589621 (https://phabricator.wikimedia.org/T249633) (owner: 10Ottomata) [15:41:38] (03CR) 10Andrew Bogott: [C: 03+2] Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [15:41:59] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [15:41:59] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [15:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for tracking this down" [puppet] - 10https://gerrit.wikimedia.org/r/589599 (https://phabricator.wikimedia.org/T247963) (owner: 10Muehlenhoff) [15:44:23] (03Abandoned) 10Filippo Giunchedi: graphite: django 2.2 compat [puppet] - 10https://gerrit.wikimedia.org/r/589576 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [15:46:13] (03PS1) 10Ottomata: eventgate-{main,analytics} staging - use Kafka TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/589627 (https://phabricator.wikimedia.org/T250149) [15:46:42] (03CR) 10Ottomata: [C: 03+2] eventgate-{main,analytics} staging - use Kafka TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/589627 (https://phabricator.wikimedia.org/T250149) (owner: 10Ottomata) [15:48:03] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [15:48:04] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [15:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:40] (03CR) 10Elukey: "Need some info about what to remove on mwlog1001 about rsync from SRE :)" [puppet] - 10https://gerrit.wikimedia.org/r/589600 (owner: 10Elukey) [15:50:33] (03CR) 10Filippo Giunchedi: Collect eventgate error.validation topics into logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [15:51:40] (03PS1) 10Andrew Bogott: designate mcrouter: add a default_policy to the route [puppet] - 10https://gerrit.wikimedia.org/r/589630 (https://phabricator.wikimedia.org/T249941) [15:52:02] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [15:52:02] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [15:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:05] 10Operations, 10RESTBase, 10RESTBase-Cassandra: restbase2014: systemd critical - cassandra-c.service loaded failed - https://phabricator.wikimedia.org/T250498 (10ayounsi) p:05Triage→03High [15:55:22] ACKNOWLEDGEMENT - Check systemd state on restbase2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi https://phabricator.wikimedia.org/T250498 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:44] (03CR) 10Andrew Bogott: [C: 03+2] designate mcrouter: add a default_policy to the route [puppet] - 10https://gerrit.wikimedia.org/r/589630 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [15:58:24] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 30 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [16:00:22] PROBLEM - Check the last execution of mediawiki_job_parser_cache_purging on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_parser_cache_purging https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:00:34] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:39] that's me, looking [16:03:11] (03PS1) 10Ottomata: eventstreams - No-op. use global puppet ca cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/589635 (https://phabricator.wikimedia.org/T249633) [16:04:16] (03CR) 10Ottomata: [C: 03+2] eventstreams - No-op. use global puppet ca cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/589635 (https://phabricator.wikimedia.org/T249633) (owner: 10Ottomata) [16:04:47] those mwmaint1002 alerts are T250231 again [16:04:48] T250231: purgeParserCache.php: Cannot purge this kind of parser cache - https://phabricator.wikimedia.org/T250231 [16:05:44] Krinkle: ^ fyi, in case you'd like a fresh one to look at :) [16:05:49] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [16:05:49] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [16:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:38] (03PS1) 10MSantos: Disable OSM replication while vacuum full [puppet] - 10https://gerrit.wikimedia.org/r/589636 [16:06:40] (03CR) 10Herron: [C: 03+1] "I like it! Thanks ottomata" [puppet] - 10https://gerrit.wikimedia.org/r/589400 (owner: 10Ottomata) [16:06:46] 10Operations, 10RESTBase, 10RESTBase-Cassandra: restbase2014: systemd critical - cassandra-c.service loaded failed - https://phabricator.wikimedia.org/T250498 (10Eevans) [16:06:49] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10Eevans) [16:07:32] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 265.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [16:07:41] (03CR) 10Gehel: [C: 03+2] Disable OSM replication while vacuum full [puppet] - 10https://gerrit.wikimedia.org/r/589636 (owner: 10MSantos) [16:08:03] 10Operations, 10RESTBase, 10RESTBase-Cassandra: restbase2014: systemd critical - cassandra-c.service loaded failed - https://phabricator.wikimedia.org/T250498 (10Eevans) I think this popped up when Puppet was re-enabled (which ironically is failing anyway because of the failed unit). It's been put under mai... [16:11:01] (03CR) 10Ottomata: "Ok, let's merge on Monday! :)" [puppet] - 10https://gerrit.wikimedia.org/r/589400 (owner: 10Ottomata) [16:15:20] (03CR) 10Vgutierrez: [C: 03+1] Add acmechief config for idp-test [puppet] - 10https://gerrit.wikimedia.org/r/589582 (https://phabricator.wikimedia.org/T233930) (owner: 10Muehlenhoff) [16:15:40] !log Revert recent email change of User:CPHL@SUL's email [16:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:38] (03PS2) 10Hnowlan: changeprop: Use staging eventgate in staging environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/589562 (https://phabricator.wikimedia.org/T249739) [16:22:54] (03PS2) 10Hnowlan: changeprop: Use global puppet CA cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/589570 (https://phabricator.wikimedia.org/T249633) [16:23:01] (03CR) 10jerkins-bot: [V: 04-1] changeprop: Use global puppet CA cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/589570 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [16:24:52] (03PS3) 10Hnowlan: changeprop: Use global puppet CA cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/589570 (https://phabricator.wikimedia.org/T249633) [16:39:28] (03PS1) 10DLynch: DiscussionTools EditAttemptStepSamplingRate increase for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589641 (https://phabricator.wikimedia.org/T250086) [16:47:28] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10serviceops, and 2 others: WANObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10Krinkle) 05Open→03Resolved a:03aaron [16:48:16] (03PS1) 10Elukey: WIP - profile::openstack::base::designate::service: fix mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/589643 [16:50:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: scb1001: Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T250482 (10wiki_willy) a:03Cmjohnson @Dzahn - when I look at the purchase date in Netbox, it shows this server was first installed 7yrs ago in January 2013. If that's accurate... [17:20:58] (03PS1) 10Andrew Bogott: Designate/mcrouter: another attempt at mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/589652 (https://phabricator.wikimedia.org/T249941) [17:22:18] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [17:22:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:12] (03CR) 10jerkins-bot: [V: 04-1] Designate/mcrouter: another attempt at mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/589652 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:25:14] 10Operations, 10Analytics, 10Traffic: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10elukey) There is currently too much data that flows to kafka, for cp3050 we have 36GB * 12 partitions for a single day, definitely too much. I took a look to kafka messages and using `kaf... [17:26:53] (03PS2) 10Andrew Bogott: Designate/mcrouter: another attempt at mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/589652 (https://phabricator.wikimedia.org/T249941) [17:30:16] (03CR) 10Andrew Bogott: [C: 03+2] Designate/mcrouter: another attempt at mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/589652 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:33:07] (03PS1) 10Andrew Bogott: designate: further (largely random) mcrouter config tweak [puppet] - 10https://gerrit.wikimedia.org/r/589657 (https://phabricator.wikimedia.org/T249941) [17:34:43] !log moving msw1 to msw-c racks mounted switch cable ports from port 49 to port 50 [17:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:19] (03CR) 10jerkins-bot: [V: 04-1] designate: further (largely random) mcrouter config tweak [puppet] - 10https://gerrit.wikimedia.org/r/589657 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:38:26] (03PS2) 10Andrew Bogott: designate: further (largely random) mcrouter config tweak [puppet] - 10https://gerrit.wikimedia.org/r/589657 (https://phabricator.wikimedia.org/T249941) [17:41:48] (03CR) 10jerkins-bot: [V: 04-1] designate: further (largely random) mcrouter config tweak [puppet] - 10https://gerrit.wikimedia.org/r/589657 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:41:58] !log replacing network cable pc1009 T250257 [17:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:03] T250257: Interface errors on asw2-c-eqiad - ge-3/0/9 (pc1009) - https://phabricator.wikimedia.org/T250257 [17:43:54] 10Operations, 10ops-eqiad: Audit msw1-eqiad cables - https://phabricator.wikimedia.org/T245188 (10ayounsi) The cables to those switches are also missing their cable IDs `lines=10 msw-a1-eqiad msw-a2-eqiad msw-a3-eqiad msw-a4-eqiad msw-a5-eqiad msw-a7-eqiad msw-a6-eqiad msw-b8-eqiad msw-b7-eqiad msw-b5-eqiad m... [17:44:53] (03PS3) 10Andrew Bogott: designate: further (largely random) mcrouter config tweak [puppet] - 10https://gerrit.wikimedia.org/r/589657 (https://phabricator.wikimedia.org/T249941) [17:46:09] 10Operations, 10ops-eqiad: Audit msw1-eqiad cables - https://phabricator.wikimedia.org/T245188 (10Cmjohnson) Verified ports for each switch, all but 2 are in port 50 on the mgmt switches A1 is port 1 C6 is port 47 [17:47:57] (03CR) 10jerkins-bot: [V: 04-1] designate: further (largely random) mcrouter config tweak [puppet] - 10https://gerrit.wikimedia.org/r/589657 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:49:42] ACKNOWLEDGEMENT - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. RLazarus T250231 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:42] ACKNOWLEDGEMENT - Check the last execution of mediawiki_job_parser_cache_purging on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_parser_cache_purging RLazarus T250231 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:51:00] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] designate: further (largely random) mcrouter config tweak [puppet] - 10https://gerrit.wikimedia.org/r/589657 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:55:36] 10Operations, 10ops-eqiad, 10DC-Ops: Interface errors on asw2-c-eqiad - ge-3/0/9 (pc1009) - https://phabricator.wikimedia.org/T250257 (10Cmjohnson) Cable has been swapped, cleared the statistics and will monitor through the weekend for any more framing errors [17:58:58] PROBLEM - Disk space on contint2001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/8103ecaaf16080d5be47dde81a7947bb22858f1ebf9d259c0745066d8eb7b0e1/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=contint2001&var-datasource=codfw+prometheus/ops [17:59:00] (03PS1) 10Andrew Bogott: Designate mcrouter: talk to memcached on port 11000 [puppet] - 10https://gerrit.wikimedia.org/r/589664 (https://phabricator.wikimedia.org/T249941) [18:00:11] 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [18:02:04] (03CR) 10jerkins-bot: [V: 04-1] Designate mcrouter: talk to memcached on port 11000 [puppet] - 10https://gerrit.wikimedia.org/r/589664 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [18:02:56] 10Operations, 10ops-eqiad: Audit msw1-eqiad cables - https://phabricator.wikimedia.org/T245188 (10ayounsi) I imported all msw1-eqiad cables into Netbox as I need them for automation. Audit still needs to happen. [18:03:46] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Designate mcrouter: talk to memcached on port 11000 [puppet] - 10https://gerrit.wikimedia.org/r/589664 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [18:04:06] 10Operations, 10homer, 10netops, 10Patch-For-Review: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 (10ayounsi) [18:13:49] (03PS1) 10Cmjohnson: Adding mgmt dns for restbase1028-1030 [dns] - 10https://gerrit.wikimedia.org/r/589667 (https://phabricator.wikimedia.org/T241784) [18:18:29] (03CR) 10VolkerE: apereo_cas: update templates login page (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [18:56:12] (03PS1) 10RLazarus: maintenance: Migrate db_lag_stats_reporter to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589672 (https://phabricator.wikimedia.org/T211250) [19:01:26] RECOVERY - Disk space on contint2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=contint2001&var-datasource=codfw+prometheus/ops [19:07:22] (03PS1) 10Krinkle: Enable LCStoreStaticArray on depooled mw1407 for benchmarking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589674 (https://phabricator.wikimedia.org/T99740) [19:08:06] (03CR) 10Krinkle: Enable LCStoreStaticArray on depooled mw1407 for benchmarking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589674 (https://phabricator.wikimedia.org/T99740) (owner: 10Krinkle) [19:09:01] (03PS1) 10Andrew Bogott: designate/mcrouter: all each mcrouter to talk to each memcached [puppet] - 10https://gerrit.wikimedia.org/r/589676 (https://phabricator.wikimedia.org/T249941) [19:12:16] (03CR) 10jerkins-bot: [V: 04-1] designate/mcrouter: all each mcrouter to talk to each memcached [puppet] - 10https://gerrit.wikimedia.org/r/589676 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [19:13:44] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] designate/mcrouter: all each mcrouter to talk to each memcached [puppet] - 10https://gerrit.wikimedia.org/r/589676 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [19:16:47] (03PS1) 10Andrew Bogott: Designate/mcrouter: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/589678 (https://phabricator.wikimedia.org/T249941) [19:17:02] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Designate/mcrouter: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/589678 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [19:26:32] (03PS1) 10RLazarus: maintenance: Migrate cirrussearch to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589680 (https://phabricator.wikimedia.org/T211250) [19:28:31] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10holger.knust) @DannyS712 Look at https://commons.wikimedi... [19:29:59] (03CR) 10jerkins-bot: [V: 04-1] maintenance: Migrate cirrussearch to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589680 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [19:30:52] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f4b4afba470: Failed to establish a new connection: [Errno 111] Connec [19:30:53] ttps://wikitech.wikimedia.org/wiki/Search%23Administration [19:32:10] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 381 threshold =0.15 breach: active_shards: 1186, active_primary_shards: 776, task_max_waiting_in_queue_millis: 0, unassigned_shards: 369, number_of_data_nodes: 3, number_of_nodes: 3, status: yellow, delayed_unassigned_shards: 0, initializing_shards: 12, relocating_shards: 0, active_shards_percent_as_nu [19:32:10] 015955, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, cluster_name: cloudelastic-chi-eqiad https://wikitech.wikimedia.org/wiki/Search%23Administration [19:32:39] (03PS2) 10RLazarus: maintenance: Migrate cirrussearch to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589680 (https://phabricator.wikimedia.org/T211250) [19:32:41] !log Depool mw1407.eqiad.wmnet for opcache and LCStoreStaticArray testing. – T99740 [19:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:47] T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 [19:33:10] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 374 threshold =0.15 breach: number_of_nodes: 3, relocating_shards: 0, number_of_data_nodes: 3, active_primary_shards: 776, status: yellow, delayed_unassigned_shards: 0, unassigned_shards: 362, initializing_shards: 12, timed_out: False, cluster_name: cloudelastic-chi-eqiad, active_shards_percent_as_numb [19:33:10] 797, active_shards: 1193, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1537, number_of_pending_tasks: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:33:28] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 370 threshold =0.15 breach: number_of_data_nodes: 3, task_max_waiting_in_queue_millis: 757, number_of_nodes: 3, status: yellow, number_of_in_flight_fetch: 0, relocating_shards: 0, unassigned_shards: 358, active_shards: 1197, timed_out: False, active_primary_shards: 776, cluster_name: cloudelastic-chi-e [19:33:28] ds_percent_as_number: 76.38800255264837, number_of_pending_tasks: 2, delayed_unassigned_shards: 0, initializing_shards: 12 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:33:43] !log Depool mw1407.eqiad.wmnet for opcache testing. Do not repool without first reverting https://gerrit.wikimedia.org/r/589674. [19:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:53] (03CR) 10Krinkle: [C: 03+2] Enable LCStoreStaticArray on depooled mw1407 for benchmarking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589674 (https://phabricator.wikimedia.org/T99740) (owner: 10Krinkle) [19:35:48] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: delayed_unassigned_shards: 0, number_of_nodes: 4, active_shards: 1449, timed_out: False, task_max_waiting_in_queue_millis: 29, initializing_shards: 5, number_of_data_nodes: 4, cluster_name: cloudelastic-chi-eqiad, relocating_shards: 0, active_shards_percent_as_number: 92.46968730057435, number_of_p [19:35:48] active_primary_shards: 776, status: yellow, number_of_in_flight_fetch: 0, unassigned_shards: 113 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:36:20] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: unassigned_shards: 46, active_shards: 1516, number_of_nodes: 4, status: yellow, number_of_data_nodes: 4, timed_out: False, relocating_shards: 0, active_shards_percent_as_number: 96.7453733248245, active_primary_shards: 776, number_of_in_flight_fetch: 0, delayed_unassigned_shards: 0, cluster_name: c [19:36:20] qiad, task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, initializing_shards: 5 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:36:42] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: timed_out: False, initializing_shards: 5, number_of_pending_tasks: 2, active_shards: 1527, task_max_waiting_in_queue_millis: 141, unassigned_shards: 35, number_of_data_nodes: 4, delayed_unassigned_shards: 0, relocating_shards: 0, status: yellow, cluster_name: cloudelastic-chi-eqiad, active_shards_p [19:36:43] 97.44735162731334, active_primary_shards: 776, number_of_nodes: 4, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:37:06] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_shards_percent_as_number: 98.2769623484365, cluster_name: cloudelastic-chi-eqiad, number_of_data_nodes: 4, number_of_nodes: 4, unassigned_shards: 22, delayed_unassigned_shards: 0, active_shards: 1540, status: yellow, timed_out: False, initializing_shards: 5, task_max_waiting_in_queue_millis: [19:37:06] flight_fetch: 0, active_primary_shards: 776, relocating_shards: 0, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:41:24] Krinkle: Wow, have we really been running LCStoreStaticArray on beta for 11 months?! [19:41:31] * James_F has seriously lost track of time. [19:41:46] James_F: I don't think so really. I think you drafted that commit 11 months ago [19:41:50] and I failed to update the date [19:41:58] Aaah. [19:41:59] but maybe.. [19:42:02] somewhere in the middle [19:42:04] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 14.84 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [19:42:08] Oy. [19:42:39] https://gerrit.wikimedia.org/r/508724 was merged on 20 Feb. [19:43:05] That feels more likely. [19:43:15] But yes, I wrote the patch a year ago. [19:43:23] * James_F stops doubting his sanity as much. [19:47:00] PROBLEM - PHP opcache health on mw1407 is CRITICAL: CRITICAL: opcache full. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:47:12] I guess that's you, Krinkle [19:47:31] Right, indeed. [19:48:42] James_F: It must of been a long week if we’ve started doubting our sanity [19:50:19] (03PS1) 10RLazarus: maintenance: Migrate generatecaptcha to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589688 (https://phabricator.wikimedia.org/T211250) [19:52:40] RhinosF1: Started? I'm 18 years into Wikimedia activity and it [19:52:44] 's all a blur. ;-) [19:58:03] James_F: True, 18 years is a long time to lose sanity. I’ve only be around 18 months and long gone mad. [19:58:19] * James_F grins. [19:58:57] * RhinosF1 often grins during irc conversations [19:59:44] Us technical people have my kind of humour [20:02:26] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 100.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [20:11:18] (03PS1) 10Andrew Bogott: designate/mcrouter: use ipv4 address for servers rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/589694 (https://phabricator.wikimedia.org/T249941) [20:12:43] (03PS1) 10RLazarus: maintenance: Migrate pageassessments to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589695 (https://phabricator.wikimedia.org/T211250) [20:15:00] (03CR) 10Andrew Bogott: [C: 03+2] designate/mcrouter: use ipv4 address for servers rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/589694 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [20:15:36] PROBLEM - Disk space on contint2001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/6406013dc5739e758ee616521304ca2658cc961fcd34dafb9890d46b6069c224/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=contint2001&var-datasource=codfw+prometheus/ops [20:29:32] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 69.48 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [20:32:10] (03PS1) 10CDanis: full deployment of nic_saturation_exporter [puppet] - 10https://gerrit.wikimedia.org/r/589703 (https://phabricator.wikimedia.org/T250401) [20:32:44] (03CR) 10CDanis: [C: 04-2] "to be deployed Monday?" [puppet] - 10https://gerrit.wikimedia.org/r/589703 (https://phabricator.wikimedia.org/T250401) (owner: 10CDanis) [20:36:26] RECOVERY - Disk space on contint2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=contint2001&var-datasource=codfw+prometheus/ops [20:50:12] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [20:56:54] (03PS1) 10RLazarus: maintenance: Migrate readinglists to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589706 (https://phabricator.wikimedia.org/T211250) [21:03:46] (03CR) 10CDanis: [C: 04-2] "to be deployed Monday? PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/22027/" [puppet] - 10https://gerrit.wikimedia.org/r/589703 (https://phabricator.wikimedia.org/T250401) (owner: 10CDanis) [21:10:33] (03CR) 10CDanis: [C: 04-2] "this PCC is slightly better; shows memcache hosts keeping the exporter https://puppet-compiler.wmflabs.org/compiler1003/22029/" [puppet] - 10https://gerrit.wikimedia.org/r/589703 (https://phabricator.wikimedia.org/T250401) (owner: 10CDanis) [21:19:05] (03PS1) 10Hashar: Point to current working directory by default [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/589710 [21:20:15] (03CR) 10jerkins-bot: [V: 04-1] Point to current working directory by default [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/589710 (owner: 10Hashar) [21:34:08] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [21:53:23] (03PS1) 10Jhedden: cloudvps: add prometheus alert rules for project instances [puppet] - 10https://gerrit.wikimedia.org/r/589716 (https://phabricator.wikimedia.org/T250206) [21:59:02] (03CR) 10Alex Monk: "hm. for tools 15% is fine but for other projects a single instance having an issue could be a critical thing. maybe that's fine?" [puppet] - 10https://gerrit.wikimedia.org/r/589716 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [22:05:10] (03CR) 10Jhedden: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/589716 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [22:18:20] (03CR) 10Jhedden: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/589716 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [23:08:11] (03PS1) 10Andrew Bogott: ordered_json.rb: add a new function, verbose_ordered_json [puppet] - 10https://gerrit.wikimedia.org/r/589741 [23:08:13] (03PS1) 10Andrew Bogott: mcrouter: get some newlines in the mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/589742 [23:08:55] (03PS1) 10Andrew Bogott: mcrouter: update example code [puppet] - 10https://gerrit.wikimedia.org/r/589743 [23:10:23] (03CR) 10jerkins-bot: [V: 04-1] ordered_json.rb: add a new function, verbose_ordered_json [puppet] - 10https://gerrit.wikimedia.org/r/589741 (owner: 10Andrew Bogott) [23:24:48] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 122 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [23:34:01] (03PS1) 10CRusnov: reports cables: Add extra regexp to support more active interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/589750 [23:34:24] (03CR) 10jerkins-bot: [V: 04-1] reports cables: Add extra regexp to support more active interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/589750 (owner: 10CRusnov)