[00:00:54] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:40] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:46] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:52] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:58] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:15:58] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [00:16:48] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:17:52] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [00:22:58] RECOVERY - Disk space on Hadoop worker on analytics1070 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [00:30:18] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=- method=POST https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [00:32:18] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [02:17:56] RECOVERY - Disk space on Hadoop worker on an-worker1081 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [02:37:40] PROBLEM - Device not healthy -SMART- on heze is CRITICAL: cluster=misc device=megaraid,6 instance=heze job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=heze&var-datasource=codfw+prometheus/ops [04:29:58] (03CR) 10Santhosh: [C: 03+1] Publish: Fix broken wikidata linking [extensions/ContentTranslation] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621497 (https://phabricator.wikimedia.org/T249458) (owner: 10KartikMistry) [04:42:58] RECOVERY - Disk space on Hadoop worker on an-worker1084 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:27:58] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) Thank you! I have just uploaded the new version of mariadb (10.4.14) to the repo which has been tested on a couple of servers (codfw and eqiad) for a week [05:29:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3318, db1105:3311 for MCR change', diff saved to https://phabricator.wikimedia.org/P12315 and previous config saved to /var/cache/conftool/dbconfig/20200824-052916-marostegui.json [05:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:59] (03PS1) 10Marostegui: db1101,db1105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/621972 [05:32:04] (03CR) 10Marostegui: [C: 03+2] db1101,db1105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/621972 (owner: 10Marostegui) [05:51:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/621816 (owner: 10Andrew Bogott) [05:51:45] (03PS5) 10Giuseppe Lavagetto: Switch all charts from "stable" to "wmf-stable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/620935 (https://phabricator.wikimedia.org/T258572) [06:13:33] RECOVERY - Disk space on Hadoop worker on an-worker1083 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:53:53] RECOVERY - Disk space on Hadoop worker on an-worker1078 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:54:01] RECOVERY - Disk space on Hadoop worker on an-worker1094 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:54:13] RECOVERY - Disk space on Hadoop worker on an-worker1095 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:55:32] 10Operations, 10Wikimedia-Mailing-lists: Disable google code in mailinglists - https://phabricator.wikimedia.org/T261084 (10Aklapper) +1 on disabling (if that exists). Thanks for filing this. [07:17:12] (03CR) 10Ema: "Thanks for your patch Ferran! While I think this would work, I'm wondering if we shouldn't just use /run/nagios/nrpe.pid for all OS versio" [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan) [07:19:38] (03CR) 10Muehlenhoff: "Agreed, /run/nagios can simply be used universally across jessie to buster." [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan) [07:22:15] (03PS1) 10Matthias Mullie: Enable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621744 (https://phabricator.wikimedia.org/T254388) [07:32:22] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) >>! In T260373#6401735, @jcrespo wrote: > @Papaul (manuel is on vacations until Monday), what about 1 on A1 and 2 on A6? Same row but it lo... [07:36:22] !log push new pfw policies - T261007 [07:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:50] (03CR) 10Jcrespo: "Overally looks ok, but check a few comments that could lead to confusing state after rename." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621618 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [07:48:21] (03CR) 10JMeybohm: "> Patch Set 4:" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/621605 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [08:05:46] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Categories reload is failing on thankyouwiki - https://phabricator.wikimedia.org/T261097 (10Gehel) [08:07:05] (03Abandoned) 10Vgutierrez: Update 0006-transaction-timeout.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621533 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [08:07:11] (03Abandoned) 10Vgutierrez: Refresh 0037-force-discard.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621534 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [08:07:20] (03PS2) 10Hashar: Run integration tests on CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621762 (https://phabricator.wikimedia.org/T261098) [08:08:03] (03CR) 10jerkins-bot: [V: 04-1] Run integration tests on CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621762 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [08:08:07] (03PS2) 10Vgutierrez: Remove unnecessary patches for Varnish 6 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621265 (https://phabricator.wikimedia.org/T260702) [08:08:09] (03PS3) 10Vgutierrez: Update 0003-vsm-perms.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621284 (https://phabricator.wikimedia.org/T260702) [08:08:11] (03PS2) 10Vgutierrez: Update 0005-stats-shortlived.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621532 (https://phabricator.wikimedia.org/T260702) [08:08:13] (03PS3) 10Vgutierrez: Update debian/control [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621693 (https://phabricator.wikimedia.org/T260702) [08:08:15] (03PS2) 10Vgutierrez: Release 6.0.6-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621694 (https://phabricator.wikimedia.org/T260702) [08:08:20] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10jcrespo) >>! In T258750#6404800, @Cmjohnson wrote: > @jcrespo This server is ready for you, I did update the site.pp role to insetup. I didn't want to install it... [08:08:40] (03CR) 10jerkins-bot: [V: 04-1] Update 0003-vsm-perms.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621284 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [08:08:48] (03CR) 10jerkins-bot: [V: 04-1] Update 0005-stats-shortlived.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621532 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [08:08:56] (03CR) 10jerkins-bot: [V: 04-1] Update debian/control [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621693 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [08:09:01] (03CR) 10jerkins-bot: [V: 04-1] Release 6.0.6-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621694 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [08:09:55] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Categories reload is failing on thankyouwiki - https://phabricator.wikimedia.org/T261097 (10Gehel) p:05Triage→03High [08:24:57] !log installing json-c security updates on buster [08:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:09] (03PS3) 10Hashar: Run integration tests on CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621762 (https://phabricator.wikimedia.org/T261098) [08:26:02] (03CR) 10jerkins-bot: [V: 04-1] Run integration tests on CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621762 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [08:29:00] (03CR) 10jerkins-bot: [V: 04-1] Remove unnecessary patches for Varnish 6 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621265 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [08:30:23] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10daniel) >>! In T260330#6401750, @tstarling wrote: > I don't know if we really gain much from object encapsulation of files, and it tends... [08:32:20] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10daniel) >>! In T240884#6401731, @tstarling wrote: > OK, I'm adding PHP execution to the service. Am I correct to assume that the PHP ex... [08:33:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: Remove legacy conditionals from profile::base::labs [puppet] - 10https://gerrit.wikimedia.org/r/617539 (owner: 10BryanDavis) [08:48:24] (03CR) 10Hashar: "The integration testwmfmariadbpy/test/integration/cli_admin/test_osc_host.py fails equally on my local machine. I found a few issues:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621762 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [08:49:57] (03PS11) 10Jcrespo: mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) [08:49:58] (03PS5) 10Jcrespo: mariadb: Apply the list of ports to the core::multiinstance class [puppet] - 10https://gerrit.wikimedia.org/r/620899 (https://phabricator.wikimedia.org/T257033) [08:52:44] !log depool cp5002 due to icinga errors [08:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:13] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: disable panel html sanitization for grafana-labs [puppet] - 10https://gerrit.wikimedia.org/r/621197 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [08:54:44] !log installing net-snmp security updates on buster [08:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:32] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: export icinga service problems as metrics [puppet] - 10https://gerrit.wikimedia.org/r/620957 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:55:35] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add alerts to 'ops' instance [puppet] - 10https://gerrit.wikimedia.org/r/620956 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:56:11] (03PS3) 10Filippo Giunchedi: prometheus: export icinga service problems as metrics [puppet] - 10https://gerrit.wikimedia.org/r/620957 (https://phabricator.wikimedia.org/T258948) [08:58:12] (03PS3) 10Filippo Giunchedi: prometheus: add alerts to 'ops' instance [puppet] - 10https://gerrit.wikimedia.org/r/620956 (https://phabricator.wikimedia.org/T258948) [08:58:25] (03PS1) 10Marostegui: production-m2.sql: Add xhguiadmin user [puppet] - 10https://gerrit.wikimedia.org/r/621985 (https://phabricator.wikimedia.org/T260640) [09:00:35] !log restart ats-tls on cp5002 [09:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:51] (03CR) 10Marostegui: "After merge, this still requires manual deployment on m2" [puppet] - 10https://gerrit.wikimedia.org/r/621985 (https://phabricator.wikimedia.org/T260640) (owner: 10Marostegui) [09:02:39] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp5002 is OK: HTTP OK: HTTP/1.1 200 Ok - 31845 bytes in 1.211 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:03:31] RECOVERY - Ensure traffic_server is running for instance tls on cp5002 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:03:55] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5002 is OK: HTTP OK: HTTP/1.0 200 OK - 23335 bytes in 0.740 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:06:36] (03CR) 10Kormat: [C: 03+1] production-m2.sql: Add xhguiadmin user [puppet] - 10https://gerrit.wikimedia.org/r/621985 (https://phabricator.wikimedia.org/T260640) (owner: 10Marostegui) [09:08:20] <_joe_> !log restarting php-fpm on mw1344 (stuck in SIGILL for new children) [09:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:23] RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:08:39] RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:10:48] !log repool cp5002 [09:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:48] !log installing libx11 security updates on stretch [09:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:12] !log restarting mw canaries to pick up libx11 update [09:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:42] (03CR) 10Marostegui: [C: 03+2] production-m2.sql: Add xhguiadmin user [puppet] - 10https://gerrit.wikimedia.org/r/621985 (https://phabricator.wikimedia.org/T260640) (owner: 10Marostegui) [09:28:32] (03PS1) 10Jcrespo: mariadb-backups: Setup dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/621987 (https://phabricator.wikimedia.org/T257551) [09:30:23] (03CR) 10Jcrespo: [C: 04-1] "Requires first dbprov1003 full setup and several database grant updates." [puppet] - 10https://gerrit.wikimedia.org/r/621987 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [09:36:10] (03CR) 10Jcrespo: "Double quotes for plain strings is not something I have seen on other WMF python files- in particular the dependency "cumin". Talk to Ricc" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621753 (owner: 10Kormat) [09:43:16] (03PS1) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [09:44:15] (03CR) 10jerkins-bot: [V: 04-1] Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [09:46:45] !log add PNI to CF on cr1-eqiad with import/export NONE - T259036 [09:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:12] 10Operations, 10Puppet, 10Release-Engineering-Team: operations-puppet repo doesn't seem sync'ed with github's - https://phabricator.wikimedia.org/T261105 (10Marostegui) [09:55:57] (03PS4) 10Giuseppe Lavagetto: Test deployments with helmfile lint [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) [09:57:55] (03CR) 10jerkins-bot: [V: 04-1] Test deployments with helmfile lint [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [09:58:10] !log oblivian@cumin1001 conftool action : set/weight=1; selector: dc=codfw,cluster=appserver,service=canary [09:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:50] 10Operations, 10SRE-Access-Requests: Request for access to analytics-privatedata-users - https://phabricator.wikimedia.org/T260450 (10Cparle) Wikitech username: Cparle Preferred shell username: cparle Email address: cparle@wikimedia.org Ssh public key (must be dedicated key for wmf production)... [10:06:00] (03PS2) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [10:12:45] !log installing firejail security updates on mw canaries [10:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:08] (03PS1) 10VulpesVulpes825: Change Chinese Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622112 (https://phabricator.wikimedia.org/T261076) [10:30:04] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200824T1030). [10:35:55] (03CR) 10ZPapierski: "Cron worked correctly today - while I still don't know why there was an issue last week, cron works correctly." [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) (owner: 10ZPapierski) [10:39:42] (03Abandoned) 10VulpesVulpes825: Change Chinese Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622112 (https://phabricator.wikimedia.org/T261076) (owner: 10VulpesVulpes825) [10:42:56] (03PS1) 10VulpesVulpes825: Change Chinese Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622114 (https://phabricator.wikimedia.org/T261076) [10:43:43] !log installing ruby2.3 security updates [10:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:22] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [10:44:34] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [10:45:22] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:622116| Bumping portals to master (T128546)]] (duration: 01m 00s) [10:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:26] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:46:21] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:622116| Bumping portals to master (T128546)]] (duration: 00m 58s) [10:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:14] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622116 (https://phabricator.wikimedia.org/T128546) [10:47:16] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622114 (https://phabricator.wikimedia.org/T261076) (owner: 10VulpesVulpes825) [10:47:20] (03CR) 10jerkins-bot: [V: 04-1] Change Chinese Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622114 (https://phabricator.wikimedia.org/T261076) (owner: 10VulpesVulpes825) [10:47:24] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622116 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:47:26] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622116 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:50:13] (03PS1) 10Giuseppe Lavagetto: Remove the original X-Forwarded-Proto header if injecting https [deployment-charts] - 10https://gerrit.wikimedia.org/r/622118 [10:52:24] (03PS1) 10VulpesVulpes825: Update Classical Chinese Wikipedia workmark and [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622119 (https://phabricator.wikimedia.org/T261110) [10:52:51] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [10:53:25] (03PS6) 10Jbond: ci - flake8: update flake8 rules to be compatible with black [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T211750) [10:53:30] (03PS8) 10Jbond: CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T211750) [10:53:36] (03PS9) 10Jbond: CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T211750) [10:53:57] (03CR) 10jerkins-bot: [V: 04-1] ci - flake8: update flake8 rules to be compatible with black [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T211750) (owner: 10Jbond) [10:53:58] (03CR) 10jerkins-bot: [V: 04-1] CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T211750) (owner: 10Jbond) [10:54:18] (03CR) 10jerkins-bot: [V: 04-1] CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T211750) (owner: 10Jbond) [10:56:57] (03PS16) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [10:57:23] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [10:58:24] (03PS2) 10VulpesVulpes825: Change Chinese Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622114 (https://phabricator.wikimedia.org/T261076) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200824T1100). [11:00:04] VulpesVulpes825, kart_, and matthiasmullie: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] I can deploy today! [11:00:20] o/ [11:01:01] Thanks! [11:01:13] (03CR) 10Urbanecm: [C: 03+2] Enable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621744 (https://phabricator.wikimedia.org/T254388) (owner: 10Matthias Mullie) [11:01:40] matthiasmullie: kart_: +2'ed your backports to save time on CI, going to do the config patches [11:01:41] Urbanecm: Great! [11:01:43] VulpesVulpes825: around? :) [11:01:55] (03CR) 10Urbanecm: [C: 03+2] Publish: Fix broken wikidata linking [extensions/ContentTranslation] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621497 (https://phabricator.wikimedia.org/T249458) (owner: 10KartikMistry) [11:02:04] Urbanecm: cool thanks [11:02:11] ,matthiasmullie thanks! [11:02:39] Sorry, I was in another channel and wondering why there is no msg from jouncebot yet ;) [11:03:38] :) [11:03:51] (03CR) 10Urbanecm: [C: 03+2] Correct the wrong workmark and tagline for Chinese Wikimedia Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621542 (https://phabricator.wikimedia.org/T260908) (owner: 10VulpesVulpes825) [11:03:53] (03CR) 10Urbanecm: [C: 03+2] Change Chinese Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622114 (https://phabricator.wikimedia.org/T261076) (owner: 10VulpesVulpes825) [11:03:55] (03CR) 10Urbanecm: [C: 03+2] Update Classical Chinese Wikipedia workmark and [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622119 (https://phabricator.wikimedia.org/T261110) (owner: 10VulpesVulpes825) [11:03:58] (03Merged) 10jenkins-bot: Correct the wrong workmark and tagline for Chinese Wikimedia Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621542 (https://phabricator.wikimedia.org/T260908) (owner: 10VulpesVulpes825) [11:04:06] (03Merged) 10jenkins-bot: Enable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621744 (https://phabricator.wikimedia.org/T254388) (owner: 10Matthias Mullie) [11:04:29] (03PS3) 10Urbanecm: Change Chinese Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622114 (https://phabricator.wikimedia.org/T261076) (owner: 10VulpesVulpes825) [11:04:32] (03PS2) 10Urbanecm: Update Classical Chinese Wikipedia workmark and [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622119 (https://phabricator.wikimedia.org/T261110) (owner: 10VulpesVulpes825) [11:04:36] (03CR) 10Urbanecm: Change Chinese Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622114 (https://phabricator.wikimedia.org/T261076) (owner: 10VulpesVulpes825) [11:04:40] (03CR) 10Urbanecm: [C: 03+2] Change Chinese Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622114 (https://phabricator.wikimedia.org/T261076) (owner: 10VulpesVulpes825) [11:04:44] (03CR) 10Urbanecm: Update Classical Chinese Wikipedia workmark and [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622119 (https://phabricator.wikimedia.org/T261110) (owner: 10VulpesVulpes825) [11:04:49] (03CR) 10Urbanecm: [C: 03+2] Update Classical Chinese Wikipedia workmark and [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622119 (https://phabricator.wikimedia.org/T261110) (owner: 10VulpesVulpes825) [11:05:36] (03Merged) 10jenkins-bot: Change Chinese Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622114 (https://phabricator.wikimedia.org/T261076) (owner: 10VulpesVulpes825) [11:05:53] VulpesVulpes825: I'm going to pull them to mwdebug all at once, if that's okay with you [11:06:00] Its Okay [11:06:27] (03PS3) 10Urbanecm: Update Classical Chinese Wikipedia workmark and [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622119 (https://phabricator.wikimedia.org/T261110) (owner: 10VulpesVulpes825) [11:06:32] (03CR) 10Urbanecm: Update Classical Chinese Wikipedia workmark and [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622119 (https://phabricator.wikimedia.org/T261110) (owner: 10VulpesVulpes825) [11:06:36] (03CR) 10Urbanecm: [C: 03+2] Update Classical Chinese Wikipedia workmark and [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622119 (https://phabricator.wikimedia.org/T261110) (owner: 10VulpesVulpes825) [11:07:25] (03Merged) 10jenkins-bot: Update Classical Chinese Wikipedia workmark and [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622119 (https://phabricator.wikimedia.org/T261110) (owner: 10VulpesVulpes825) [11:08:16] Urbanecm: But please allow me a little more time for checking, since I need to check all Chinese Wikimedia Projects. [11:08:29] sure [11:08:37] VulpesVulpes825: pulled onto mwdebug1002 [11:08:42] Got it [11:09:23] (03PS9) 10Ema: cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) [11:11:28] (03PS3) 10Hnowlan: api-gateway: strip cookie headers from requests and responses. [deployment-charts] - 10https://gerrit.wikimedia.org/r/620311 (https://phabricator.wikimedia.org/T259296) [11:12:13] (03CR) 10Hnowlan: api-gateway: strip cookie headers from requests and responses. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/620311 (https://phabricator.wikimedia.org/T259296) (owner: 10Hnowlan) [11:13:24] Urbanecm: The Classical Chinese Wikipedia wordmark and tagline update is successful. But I did not see the other two patches changes. [11:13:47] let me see [11:15:30] VulpesVulpes825: I confirm https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/622114 is at mwdebug1002 [11:15:59] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Joe) >>! In T260330#6401694, @tstarling wrote: >> * The service will be firewalled from all network access. We might consider adding sp... [11:16:03] can you confirm you test via the correct mwdebug server (mwdebug1002 rather than 1001 I usually use)? [11:17:14] Urbanecm: Intresting, the wordmark for patch 622114 works, but not for the project-logos [11:17:23] (03CR) 10Hnowlan: [C: 03+2] Include certs into annotations for api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/621716 (owner: 10Ppchelko) [11:17:57] (03CR) 10Hnowlan: [C: 03+2] Add jwt and ratelimiter fixtures to gateway for more validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/620992 (owner: 10Ppchelko) [11:18:03] (03CR) 10Ema: [C: 03+2] cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:18:15] VulpesVulpes825: I can see the updated logo as well. Can you purge your browser cache or test from inkognito window, please? [11:18:28] (03Merged) 10jenkins-bot: Include certs into annotations for api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/621716 (owner: 10Ppchelko) [11:19:07] (03Merged) 10jenkins-bot: Add jwt and ratelimiter fixtures to gateway for more validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/620992 (owner: 10Ppchelko) [11:20:06] (03PS1) 10Marostegui: mariadb: Move db1128 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/622120 (https://phabricator.wikimedia.org/T260324) [11:20:29] Urbanecm: I did both, but I can not see the logo change on my end. [11:21:21] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [11:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:23] (03PS1) 10Cmjohnson: updating partman recipe for pki1001 [puppet] - 10https://gerrit.wikimedia.org/r/622121 (https://phabricator.wikimedia.org/T259826) [11:21:44] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) >>! In T260330#6405301, @daniel wrote: >>>! In T240884#6401731, @tstarling wrote: >> OK, I'm adding PHP execution to the serv... [11:22:00] VulpesVulpes825: does https://en.wikipedia.org/static/images/project-logos/zhwikisource.png display the correct logo? [11:22:04] (03CR) 10Cmjohnson: [C: 03+2] updating partman recipe for pki1001 [puppet] - 10https://gerrit.wikimedia.org/r/622121 (https://phabricator.wikimedia.org/T259826) (owner: 10Cmjohnson) [11:22:21] Urbanecm: That one is correct [11:22:33] and where does it display the incorrect logo, VulpesVulpes825 ? [11:22:45] On the Chinese Wikisource mainpage [11:23:10] (03Merged) 10jenkins-bot: Publish: Fix broken wikidata linking [extensions/ContentTranslation] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621497 (https://phabricator.wikimedia.org/T249458) (owner: 10KartikMistry) [11:23:17] you mean, https://zh.wikisource.org/wiki/Wikisource:%E9%A6%96%E9%A1%B5? [11:23:37] That is correct [11:23:52] So https://zh.wikisource.org/static/images/project-logos/zhwikisource-2x.png gives the old logo, but not https://en.wikipedia.org/static/images/project-logos/zhwikisource.png [11:24:00] aha!" [11:24:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` pki1001.eqiad.wmnet ` The log can be found in... [11:24:58] VulpesVulpes825: this is the 2x file on my end, is that correct? https://usercontent.irccloud-cdn.com/file/bb9aeHpV/image.png [11:25:11] That is correct [11:25:30] hmm, so it seems your computer's cache is broken. I'd call it working then [11:25:40] what about the rest of the changes? do they work? [11:26:20] The rest are all working [11:26:33] okay, so I'm going to sync them all [11:26:47] Great. [11:29:10] !log urbanecm@deploy1001 Synchronized static/images/: fe0449d244ee876e4fb64da630f0994ab114f248: 74220d0943e6b32cce3c93dd5b9f8bbc63fa5d73: 7db8a19c512cea84f3000463e9dfb6617857c9a6: Update Chinese wordmarks and taglines, update zhwikisource project logo (T260908; T258552; T261076; T261110) (duration: 00m 58s) [11:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:17] T261076: Change Chinese Wikisource Logo - https://phabricator.wikimedia.org/T261076 [11:29:17] T258552: Add wordmarks and taglines for 26 more Wikipedias - https://phabricator.wikimedia.org/T258552 [11:29:18] T261110: Update Classical Chinese Wikipedia workmark and - https://phabricator.wikimedia.org/T261110 [11:29:18] T260908: Wrong Wordmark and Tagline for Chinese Wikimedia Project - https://phabricator.wikimedia.org/T260908 [11:30:47] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: fe0449d244ee876e4fb64da630f0994ab114f248: 74220d0943e6b32cce3c93dd5b9f8bbc63fa5d73: 7db8a19c512cea84f3000463e9dfb6617857c9a6: Update Chinese wordmarks and taglines, update zhwikisource project logo (T260908; T258552; T261076; T261110) (duration: 00m 59s) [11:30:55] VulpesVulpes825: done [11:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:11] kart_: I'm done. Do you want to self-service, or should I deploy that? [11:31:31] Urbanecm: Thank you so much, and sorry for my broken cache. [11:31:46] no problem, that can happen :) [11:32:09] !log add liblept5 1.76.0-1~bpo9+1 (and leptonica-progs) to stretch-wikimedia/component/tesseract-410-bpo (T247422) [11:32:12] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [11:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:13] T247422: Update Tesseract on Toolforge to v4.1.0 - https://phabricator.wikimedia.org/T247422 [11:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:20] Urbanecm: can you deploy it? [11:33:42] I'm preparing test article. [11:34:28] kart_: sure [11:34:52] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [11:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:23] kart_: pulled onto mwdebug1002, please test [11:36:53] matthiasmullie: I pulled your backport there as well, as it is also merged. [11:37:12] thanks, will check [11:37:58] Urbanecm: not sure, if it will work with mwdebug, but let me try. [11:38:12] happy to deploy in case it doesn't :) [11:39:06] !log Purge 13 URLs with purgeList.php, see P12316 for list of them (T260908; T258552; T261076; T261110) [11:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:12] T261076: Change Chinese Wikisource Logo - https://phabricator.wikimedia.org/T261076 [11:39:12] T258552: Add wordmarks and taglines for 26 more Wikipedias - https://phabricator.wikimedia.org/T258552 [11:39:13] T261110: Update Classical Chinese Wikipedia workmark and - https://phabricator.wikimedia.org/T261110 [11:39:13] T260908: Wrong Wordmark and Tagline for Chinese Wikimedia Project - https://phabricator.wikimedia.org/T260908 [11:39:23] 10Operations, 10Puppet, 10Release-Engineering-Team: operations-puppet repo doesn't seem sync'ed with github's - https://phabricator.wikimedia.org/T261105 (10Marostegui) p:05Triage→03Medium [11:40:09] Urbanecm: worked :D [11:40:12] Urbanecm: go ahead. [11:40:17] okay, syncing then [11:41:54] Urbanecm: should be ok to sync [11:41:58] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.5/extensions/ContentTranslation/modules/publish/ext.cx.wikibase.link.js: 74a87184408937bcdb4a27f1f563bbbdff45cf97: Publish: Fix broken wikidata linking (T249458) (duration: 00m 58s) [11:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:02] T249458: ContentTranslation is not adding pages sitelinks to wikidata - https://phabricator.wikimedia.org/T249458 [11:42:21] matthiasmullie: thanks, will do [11:43:31] Urbanecm: thanks a lot! [11:43:37] happy to help kart_ ! [11:43:50] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.5/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: 1066ecbe2836e69211c905f597ad6b62241528c0: Enable MediaSearch A/B test (T254388) (duration: 00m 56s) [11:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:55] T254388: Create interleaved A/B test for searching using new commons-specific elasticsearch query builder - https://phabricator.wikimedia.org/T254388 [11:43:56] musikanimal: should be done! [11:44:04] anything else? [11:44:19] matthiasmullie: ^ sorry, wrong ping [11:44:28] Urbanecm: thanks! [11:44:42] happy to help! [11:47:30] (03PS1) 10Urbanecm: Enable mapframe at trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622124 (https://phabricator.wikimedia.org/T260594) [11:47:58] (03CR) 10Urbanecm: [C: 03+2] Enable mapframe at trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622124 (https://phabricator.wikimedia.org/T260594) (owner: 10Urbanecm) [11:48:39] (03CR) 10JMeybohm: "LGTM but do we need to bumb some charts to have them include this?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/622118 (owner: 10Giuseppe Lavagetto) [11:48:41] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [11:48:42] (03Merged) 10jenkins-bot: Enable mapframe at trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622124 (https://phabricator.wikimedia.org/T260594) (owner: 10Urbanecm) [11:50:35] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: e1ae39afbb4d6f33e74782580db7dfee06d0097d: Enable mapframe at trwiki (T260594) (duration: 00m 58s) [11:50:36] (03PS1) 10Cmjohnson: Add an-test-workers dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/622125 (https://phabricator.wikimedia.org/T255520) [11:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:38] T260594: Enable mapframe tag for trwiki - https://phabricator.wikimedia.org/T260594 [11:51:04] (03CR) 10jerkins-bot: [V: 04-1] Add an-test-workers dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/622125 (https://phabricator.wikimedia.org/T255520) (owner: 10Cmjohnson) [11:53:55] (03PS2) 10Cmjohnson: Add an-test-workers dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/622125 (https://phabricator.wikimedia.org/T255520) [11:54:23] (03CR) 10jerkins-bot: [V: 04-1] Add an-test-workers dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/622125 (https://phabricator.wikimedia.org/T255520) (owner: 10Cmjohnson) [11:54:25] (03PS1) 10Urbanecm: Add retrobibliothek.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622126 (https://phabricator.wikimedia.org/T261012) [11:54:49] (03CR) 10Urbanecm: [C: 03+2] Add retrobibliothek.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622126 (https://phabricator.wikimedia.org/T261012) (owner: 10Urbanecm) [11:54:54] (03PS1) 10Ayounsi: Add "preferred" to primary when more than 2 IPs configured [homer/public] - 10https://gerrit.wikimedia.org/r/622127 [11:55:09] PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100% [11:55:17] (03Abandoned) 10Ema: ATS: add caching rule for thanos-query.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/617420 (https://phabricator.wikimedia.org/T151009) (owner: 10Ema) [11:55:53] (03Merged) 10jenkins-bot: Add retrobibliothek.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622126 (https://phabricator.wikimedia.org/T261012) (owner: 10Urbanecm) [11:55:57] (03PS3) 10Urbanecm: Enable tewiki as import source for tewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619371 (https://phabricator.wikimedia.org/T260107) (owner: 10Jayprakash12345) [11:56:01] PROBLEM - Host prometheus5001 is DOWN: PING CRITICAL - Packet loss = 100% [11:56:01] (03CR) 10Urbanecm: [C: 03+2] "per Jay's request" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619371 (https://phabricator.wikimedia.org/T260107) (owner: 10Jayprakash12345) [11:56:10] (03PS3) 10Cmjohnson: Add an-test-workers dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/622125 (https://phabricator.wikimedia.org/T255520) [11:56:57] (03Merged) 10jenkins-bot: Enable tewiki as import source for tewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619371 (https://phabricator.wikimedia.org/T260107) (owner: 10Jayprakash12345) [11:57:10] (03CR) 10Cmjohnson: [C: 03+2] Add an-test-workers dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/622125 (https://phabricator.wikimedia.org/T255520) (owner: 10Cmjohnson) [11:57:34] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 5a6d025b04eb20787e8abbbdd56a3abb3818b82f: Add retrobibliothek.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T261012) (duration: 00m 56s) [11:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:38] T261012: Add retrobibliothek.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T261012 [11:58:32] !log test advertise CF tunnel endpoint on cr1-eqiad - T259036 [11:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-17) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10Cmjohnson) [11:58:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-sidecar site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:59:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-17) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10Cmjohnson) @elukey I do not know what partman recipe you need for these. Need to update that info and enable the ports and the servers a... [12:00:52] 10Operations, 10Gerrit, 10Wikimedia-GitHub, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): operations-puppet repo doesn't seem sync'ed with github's - https://phabricator.wikimedia.org/T261105 (10hashar) p:05Medium→03Unbreak! All re... [12:01:10] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 8c380d65d760591099c296ae522b2e63953413aa: Enable tewiki as import source for tewikibooks (T260107) (duration: 00m 57s) [12:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:14] T260107: Enable Telugu wikipedia as Import source for Telugu wikibooks - https://phabricator.wikimedia.org/T260107 [12:01:22] !log EU B&C window completed [12:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1105:3311', diff saved to https://phabricator.wikimedia.org/P12317 and previous config saved to /var/cache/conftool/dbconfig/20200824-120310-marostegui.json [12:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:12] (03PS1) 10Marostegui: db1105: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622128 [12:03:39] (03CR) 10Marostegui: [C: 03+2] db1105: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622128 (owner: 10Marostegui) [12:05:00] (03PS1) 10Urbanecm: admin: Update urbanecm's default home - add .vimrc [puppet] - 10https://gerrit.wikimedia.org/r/622129 [12:05:05] (03PS1) 10Cmjohnson: Adding an-test-workers to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/622130 (https://phabricator.wikimedia.org/T255520) [12:05:54] 10Operations, 10Gerrit, 10Wikimedia-GitHub, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): operations-puppet repo doesn't seem sync'ed with github's - https://phabricator.wikimedia.org/T261105 (10hashar) 05Open→03Resolved a:03hasha... [12:06:24] (03CR) 10Cmjohnson: [C: 03+2] Adding an-test-workers to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/622130 (https://phabricator.wikimedia.org/T255520) (owner: 10Cmjohnson) [12:06:58] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10daniel) > The "call" action also requires a list of PHP input files, so that's how you define the function you're calling. So the PHP c... [12:08:45] 10Operations, 10Gerrit, 10Wikimedia-GitHub, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): operations-puppet repo doesn't seem sync'ed with github's - https://phabricator.wikimedia.org/T261105 (10hashar) See past occurrence: {T240322} [12:10:53] (03PS1) 10Ema: cache: remove 'backend_services' hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/622131 (https://phabricator.wikimedia.org/T222937) [12:11:38] (03PS3) 10JMeybohm: sre.discovery: Refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/621721 (https://phabricator.wikimedia.org/T260663) [12:12:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1105:3311', diff saved to https://phabricator.wikimedia.org/P12318 and previous config saved to /var/cache/conftool/dbconfig/20200824-121200-marostegui.json [12:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:09] (03CR) 10jerkins-bot: [V: 04-1] sre.discovery: Refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/621721 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [12:14:12] (03PS2) 10Ema: cache: remove 'backend_services' hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/622131 (https://phabricator.wikimedia.org/T222937) [12:16:24] 10Operations, 10SRE-tools, 10serviceops, 10Patch-For-Review: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) As this is not k8s specific I decided to refactor sre.discovery instead of generating a new cookbook. We can... [12:16:40] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/622131 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [12:19:08] (03PS4) 10JMeybohm: sre.discovery: Refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/621721 (https://phabricator.wikimedia.org/T260663) [12:20:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1105:3311', diff saved to https://phabricator.wikimedia.org/P12319 and previous config saved to /var/cache/conftool/dbconfig/20200824-122050-marostegui.json [12:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:53] (03CR) 10JMeybohm: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/621721 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [12:27:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1105:3311 after MCR change', diff saved to https://phabricator.wikimedia.org/P12320 and previous config saved to /var/cache/conftool/dbconfig/20200824-122752-marostegui.json [12:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1084 for MCR change', diff saved to https://phabricator.wikimedia.org/P12321 and previous config saved to /var/cache/conftool/dbconfig/20200824-122848-marostegui.json [12:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:23] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pki1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['pki1001.eqiad.wmnet'] ` [12:29:37] (03PS1) 10Cmjohnson: Adding an-test-coord/test-masters to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/622132 (https://phabricator.wikimedia.org/T255518) [12:29:54] (03PS1) 10Marostegui: db1084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622133 [12:30:29] (03CR) 10Marostegui: [C: 03+2] db1084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622133 (owner: 10Marostegui) [12:30:37] (03PS3) 10Southparkfan: nagios-nrpe-server systemd unit: use /run for PID files [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) [12:33:15] (03PS1) 10Cmjohnson: Adding an-test-masters/test-coord1001 to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/622135 (https://phabricator.wikimedia.org/T255518) [12:34:11] (03CR) 10Cmjohnson: [C: 03+2] Adding an-test-coord/test-masters to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/622132 (https://phabricator.wikimedia.org/T255518) (owner: 10Cmjohnson) [12:34:21] (03PS2) 10Cmjohnson: Adding an-test-coord/test-masters to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/622132 (https://phabricator.wikimedia.org/T255518) [12:34:30] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Adding an-test-coord/test-masters to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/622132 (https://phabricator.wikimedia.org/T255518) (owner: 10Cmjohnson) [12:34:54] (03PS2) 10Cmjohnson: Adding an-test-masters/test-coord1001 to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/622135 (https://phabricator.wikimedia.org/T255518) [12:36:07] (03CR) 10Cmjohnson: [C: 03+2] Adding an-test-masters/test-coord1001 to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/622135 (https://phabricator.wikimedia.org/T255518) (owner: 10Cmjohnson) [12:36:31] (03CR) 10Jbond: cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [12:39:29] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10JMeybohm) >>! In T260330#6405691, @Joe wrote: > My current idea is that we will run these "nanoservices" as normal kubernetes services,... [12:39:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-02) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Cmjohnson) [12:40:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-02) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Cmjohnson) @elukey The same thing for these, I need to know what partman recipe you want or feel free to add them yourself. Once that... [12:44:10] (03PS1) 10Vgutierrez: ATS: Disable ECDHE-ECDSA-AES128-SHA support [puppet] - 10https://gerrit.wikimedia.org/r/622138 (https://phabricator.wikimedia.org/T258405) [12:50:50] (03PS1) 10Marostegui: db1101: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622139 [12:51:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P12322 and previous config saved to /var/cache/conftool/dbconfig/20200824-125131-marostegui.json [12:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:37] (03CR) 10Marostegui: [C: 03+2] db1101: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622139 (owner: 10Marostegui) [12:56:13] (03CR) 10Kormat: mariadb: Move db1128 to m5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622120 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [12:56:17] 10Operations, 10observability: Grafana/Thanos serves 503s for long-time-window requests - https://phabricator.wikimedia.org/T260241 (10fgiunchedi) p:05High→03Medium Lowering priority as the main issue has been mitigated, still work to do though to improve performance [12:57:13] (03CR) 10Vgutierrez: [C: 03+1] cache: remove 'backend_services' hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/622131 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [12:58:53] (03CR) 10Krinkle: [C: 03+1] "lgtm, I wonder what we do for such cnf files elsewhere. I'd expect it to either be dynamic/merge/append, or for it to live in the role cla" [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [12:59:57] (03CR) 10Marostegui: mariadb: Move db1128 to m5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622120 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [13:00:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P12323 and previous config saved to /var/cache/conftool/dbconfig/20200824-130024-marostegui.json [13:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:30] (03CR) 10Kormat: [C: 03+1] mariadb: Move db1128 to m5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622120 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [13:05:37] !log installing imagemagick security updates on stretch [13:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:05] (03PS2) 10Marostegui: mariadb: Move db1128 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/622120 (https://phabricator.wikimedia.org/T260324) [13:08:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] confd: disable -watch for machines connected to etcd 3.x [puppet] - 10https://gerrit.wikimedia.org/r/621484 (https://phabricator.wikimedia.org/T260889) (owner: 10Giuseppe Lavagetto) [13:08:57] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1128 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/622120 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [13:13:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P12325 and previous config saved to /var/cache/conftool/dbconfig/20200824-131305-marostegui.json [13:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:02] 10Operations, 10Discovery-Search (Current work): wdqs1009 has puppet changes on each run - https://phabricator.wikimedia.org/T260123 (10dcausse) The `wdqs-updater` service is configured with: ` ConditionPathExists=<%= @data_dir %>/wikidata.jnl ConditionPathExists=<%= @data_dir %>/data_loaded ` if the data-rel... [13:25:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [13:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:02] (03CR) 10Ema: [C: 03+1] ATS: Disable ECDHE-ECDSA-AES128-SHA support [puppet] - 10https://gerrit.wikimedia.org/r/622138 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [13:27:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:25] (03PS3) 10Kormat: Add 'black' formatter support. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621724 [13:28:32] (03PS3) 10Kormat: Run 'black' against setup.py and wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621753 [13:28:34] (03PS3) 10Kormat: Run 'black' in CI against wmfmariadbpy. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621754 [13:30:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1101:3318 after MCR change', diff saved to https://phabricator.wikimedia.org/P12326 and previous config saved to /var/cache/conftool/dbconfig/20200824-133032-marostegui.json [13:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:26] (03CR) 10JMeybohm: [C: 03+1] "just a nit" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [13:31:29] (03CR) 10Jbond: "it just got pointed out to me that historicly only global roots are allowed to login to the cumin hosts. this CR updates that policy as s" [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [13:31:34] (03CR) 10Jbond: [C: 04-1] cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [13:35:01] (03CR) 10JMeybohm: [C: 04-1] helmfile: add values for staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/621605 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [13:35:15] (03CR) 10Volans: "I'd like to have Moritz comment on this. As of now AFAIK the cumin hosts are considered 'ops' only and as such might contain sensitive dat" [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [13:35:53] (03PS6) 10JMeybohm: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [13:35:59] Hey! I'm about to do my first ever (secruty patch) deploy, anyone around to hold my hand?... [13:36:02] I#m scare :P [13:36:17] The instructions on https://wikitech.wikimedia.org/wiki/How_to_perform_security_fixes look clear enoughm, but still... [13:38:13] (03CR) 10jerkins-bot: [V: 04-1] helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [13:38:59] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:39:29] @Reedy: you around? [13:39:35] I am [13:40:02] Ah great! I'm looking for someone to hold my hand during my first deploy ever. [13:40:09] Secruity patch for T260485 [13:40:22] Do you have 15 minutes or so? [13:41:05] I'm just finishing off an email to Safeguard, but yeah [13:41:06] !now [13:41:08] !next [13:41:10] jouncebot: now [13:41:11] No deployments scheduled for the next 3 hour(s) and 18 minute(s) [13:41:12] jouncebot: next [13:41:12] In 3 hour(s) and 18 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200824T1700) [13:41:30] doorbell [13:41:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Good job, I left a few comments, in some cases I just wasn't 100% sure of what approach you were taking." (0312 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/621721 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [13:43:31] (03PS1) 10Filippo Giunchedi: alertmanager: improve Karma sorting and coloring [puppet] - 10https://gerrit.wikimedia.org/r/622154 (https://phabricator.wikimedia.org/T258948) [13:43:36] re [13:43:40] (03PS1) 10Filippo Giunchedi: alertmanager: group Icinga alerts by name and service [puppet] - 10https://gerrit.wikimedia.org/r/622155 (https://phabricator.wikimedia.org/T258948) [13:43:57] Reedy: for the signoff, do i need a gpg key on the deploy host? [13:44:19] I don't think we consistently sign patches off [13:44:31] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:44:34] the instructions include that step, so i'll at least try [13:44:48] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: group Icinga alerts by name and service [puppet] - 10https://gerrit.wikimedia.org/r/622155 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [13:44:52] works without key it seems [13:44:54] Which instructions? [13:44:55] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: improve Karma sorting and coloring [puppet] - 10https://gerrit.wikimedia.org/r/622154 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [13:44:58] https://wikitech.wikimedia.org/wiki/How_to_perform_security_fixes [13:45:15] based on the branch used in the example, it'S recent :) [13:45:49] I love the inconsistency with https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Security_patches [13:45:57] * Reedy files a bug about that [13:46:38] can i continue using the instructions i pointed to, or do i need to read the other set? [13:46:53] Yeah, you can continue [13:47:11] ok. i merged the patch in the staging dir. i'm now moving the patch file to the patch dir [13:47:25] (03PS1) 10Filippo Giunchedi: Use label 'severity' for service problem [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/622156 (https://phabricator.wikimedia.org/T258948) [13:48:19] ok, it's at /srv/patches/1.36.0-wmf.5/core/07-T260485.patch [13:48:24] do i need to worry about file permissions? [13:48:38] depends on your umask :D [13:48:56] 0002 [13:49:01] -rw-r--r-- [13:49:04] looks ok to me [13:49:26] yeah, groups is wikidev too, so should be fine [13:50:24] I added a commit to the patch repo [13:51:24] the instructions say that "You may want to verify at this point that the bug is fixed on test.wikipedia.org", but that doesn#t work if i don't deploy it, right?... [13:52:06] This fix is for the actor recorded when applying global user blocks. Not sure how to test that.. [13:52:46] btw, I'm assuming that 1.36.0-wmf.5 is the only active branch right now. [13:53:25] https://versions.toolforge.org/ tells you the versions [13:53:48] You can deploy to the test hosts using https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_the_deployment_host [13:54:02] not that section... [13:54:14] https://wikitech.wikimedia.org/wiki/WikimediaDebug#Staging_changes [13:54:19] Why is that a seperate page and.. [13:54:21] 10Operations, 10SRE-Access-Requests: Request for access to analytics-privatedata-users - https://phabricator.wikimedia.org/T260450 (10Nuria) @cparle: so we can better direct this request, what information are you looking for when it comes to mediasearch? [13:54:58] by the way, i'd love a way to test security patches on beta. [13:55:10] i guess i should fiel a bug for that [13:55:17] lol [13:55:30] Beta is basically open access... So it'd be pretty hard as is [13:55:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1092 for MCR change', diff saved to https://phabricator.wikimedia.org/P12327 and previous config saved to /var/cache/conftool/dbconfig/20200824-135538-marostegui.json [13:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:46] (03PS1) 10Marostegui: db1092: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622157 [13:57:26] ok, did a scap pull on mwdebug1001, and using the debug browser extension to hit this host. [13:57:34] i'll make a few edits to ssee that nothing explodes. [13:57:56] (03CR) 10Marostegui: [C: 03+2] db1092: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622157 (owner: 10Marostegui) [13:58:03] I'm not a steward, so i can't test cross-wiki blocks [13:59:22] !log Stop mysql on db1117:3325 to clone db1128 - T260324 [13:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:26] T260324: Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 [13:59:29] ^ this will trigger haproxy alerts [14:00:13] Reedy: ok, so nothing exploded [14:00:28] I just realized i'm supposed to be in a meeting. great. [14:00:38] haha [14:00:39] I'll join that and be really distracted while i scap :P [14:01:17] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [14:05:52] Reedy: I'm about to hit enter of this: scap sync-file --no-log-message php-1.36.0-wmf.5/includes/ActorMigration.php 'Deploy security fix for T260485' [14:06:03] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [14:06:04] sounds good? [14:06:07] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [14:06:13] ^ expected [14:06:25] Aye [14:06:32] ok, doing it [14:06:56] * duesen holds breath [14:07:10] (03PS1) 10Herron: aptrepo: add thirdparty/elastic79 component [puppet] - 10https://gerrit.wikimedia.org/r/622158 [14:08:30] ok, scap finished [14:08:36] !log Deployed patch for T260485 [14:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:56] Reedy: Do I assume correctly that it's now up to the security team to get thsi fix into master (and do backports?) [14:10:08] Not necessarily [14:10:17] would be nice to have this in master soon, so we can write better code that replaces the present hack. [14:10:17] Depends on the nature of the bug [14:10:30] Can it immediately go into master? Is disclosing this an issue? [14:10:35] Does it affect other branches or only master? [14:10:35] Ok, can you comment on the ticket with what you think the next steps should be? [14:10:56] Affect 1.35, I think [14:11:29] If it's only 1.35 or newer... We can probably get away with just putting it into master and backporting [14:11:39] Get it out in the 1.35 final release [14:11:53] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [14:11:57] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [14:12:55] Reedy: at what point is it ok to make the ticket publick, and put the fix on gerrit? [14:13:15] So it depends on the above [14:13:31] If it's only an issue in master (for example)... you can basically do it immediately after deploying it (and it being tested properly) [14:13:52] I'm "pretty" sure it'S not in 1.34 [14:14:00] If it was on stable release branches, it would (probably) have to wait till the next security release [14:14:22] Depending on severity, of course... If it's a minor/hardening type issue, it can be made public fairly easily [14:14:24] let me ask this differently: can i drop this off the platform engineering board? [14:14:26] 10Operations, 10SRE-Access-Requests: Request for access to analytics-privatedata-users - https://phabricator.wikimedia.org/T260450 (10Cparle) Well, @EBernhardson wrote this for me https://people.wikimedia.org/~ebernhardson/commonswiki_queries_across_wikis_20200801_20200807.html and I want to be able to write... [14:14:43] Dunno. I don't know your practices :) [14:14:57] The security team doesn't have to be the one to do the backports etc [14:15:29] I guess if we can clarify for definite it's not in REL1_34, cleaning it all up should be pretty simple [14:17:34] It can then just go into master and REL1_34 and be opened up etc. All good [14:18:00] !log disabling puppet on cumin1001 and starting a test of the DC switchover automation, expect some SAL noise but no production impact [14:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:16] FYI ^ and I'll be watching this channel just in case [14:22:11] !log installing libexif security updates on stretch [14:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:36] (03CR) 10Andrew Bogott: [C: 03+2] standard_packages: remove purge of 'at' package [puppet] - 10https://gerrit.wikimedia.org/r/621816 (owner: 10Andrew Bogott) [14:24:06] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [14:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:09] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [14:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:24] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [14:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:35] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=99) [14:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:01] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [14:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:10] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [14:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:19] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [14:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:50] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Categories reload is failing on thankyouwiki - https://phabricator.wikimedia.org/T261097 (10dcausse) There are two dump types involved here: - the full dumps happening once a week and generating a RDF turtle file (p... [14:30:33] PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [14:31:55] RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [14:31:57] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001 job=burrow partition={2,3} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=tha [14:31:57] ogging-eqiad&var-topic=All&var-consumer_group=All [14:32:02] (03PS1) 10Ppchelko: Remove service.key configs from flent-bit TLS. [deployment-charts] - 10https://gerrit.wikimedia.org/r/622159 [14:32:37] 10Operations, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Platform Team Sprints Board (Sprint 1), and 2 others: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10eprodromou) Ha! Sorry about that, @Brandon! [14:33:40] (03PS2) 10Vgutierrez: ATS: Disable ECDHE-RSA-AES128-SHA support [puppet] - 10https://gerrit.wikimedia.org/r/622138 (https://phabricator.wikimedia.org/T258405) [14:34:04] (03CR) 10Jbond: [C: 03+2] admin: Update urbanecm's default home - add .vimrc [puppet] - 10https://gerrit.wikimedia.org/r/622129 (owner: 10Urbanecm) [14:34:13] (03CR) 10Jbond: [C: 03+2] "merging" [puppet] - 10https://gerrit.wikimedia.org/r/622129 (owner: 10Urbanecm) [14:35:29] 10Operations, 10Traffic: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10eprodromou) [14:35:32] 10Operations, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Platform Team Sprints Board (Sprint 1), and 2 others: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10eprodromou) [14:35:49] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 8553 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:36:07] thanks jbond42 :) [14:36:53] <_joe_> the codfw stuff is us tesitng the switchover [14:38:11] (03CR) 10Cwhite: [C: 03+1] aptrepo: add thirdparty/elastic79 component [puppet] - 10https://gerrit.wikimedia.org/r/622158 (owner: 10Herron) [14:38:18] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [14:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:01] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [14:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:03] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 223.7 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:39:20] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [14:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:13] (03CR) 10Ppchelko: api-gateway: strip cookie headers from requests and responses. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/620311 (https://phabricator.wikimedia.org/T259296) (owner: 10Hnowlan) [14:41:16] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/621472 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [14:41:39] !log creating cirrus indices for lldwiki [14:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:55] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [14:41:56] !log rzl@cumin1001 [DRY-RUN] MediaWiki read-only period starts at: 2020-08-24 14:41:55.754938 [14:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:07] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [14:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:11] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [14:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:28] (03CR) 10Filippo Giunchedi: [C: 03+1] aptrepo: add thirdparty/elastic79 component [puppet] - 10https://gerrit.wikimedia.org/r/622158 (owner: 10Herron) [14:42:37] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Use label 'severity' for service problem [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/622156 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [14:42:45] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=99) [14:42:47] (03CR) 10Herron: [C: 03+2] aptrepo: add thirdparty/elastic79 component [puppet] - 10https://gerrit.wikimedia.org/r/622158 (owner: 10Herron) [14:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:03] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [14:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:11] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [14:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:19] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions [14:43:21] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) [14:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:28] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [14:43:30] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [14:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:35] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:43:35] !log rzl@cumin1001 [DRY-RUN] MediaWiki read-only period ends at: 2020-08-24 14:43:35.570234 [14:43:35] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [14:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:57] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [14:45:19] Reedy: thanks for your help. I have tagged the ticket for another round through triage :) [14:46:09] (03CR) 10Ayounsi: [C: 03+2] "Tested and works as expected." [homer/public] - 10https://gerrit.wikimedia.org/r/622127 (owner: 10Ayounsi) [14:46:36] (03Merged) 10jenkins-bot: Add "preferred" to primary when more than 2 IPs configured [homer/public] - 10https://gerrit.wikimedia.org/r/622127 (owner: 10Ayounsi) [14:46:38] uughh ganeti5002 is down looks like, I'll take a look [14:47:23] !log powercycle ganeti5002 -- host down and nothing in console [14:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:12] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [14:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:23] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=99) [14:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:46] duesen: Reedy: thanks for getting that live :). [14:48:56] (03CR) 10Hnowlan: [C: 03+2] Remove service.key configs from flent-bit TLS. [deployment-charts] - 10https://gerrit.wikimedia.org/r/622159 (owner: 10Ppchelko) [14:50:04] (03Merged) 10jenkins-bot: Remove service.key configs from flent-bit TLS. [deployment-charts] - 10https://gerrit.wikimedia.org/r/622159 (owner: 10Ppchelko) [14:51:01] RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 228.50 ms [14:51:17] (03PS1) 10Herron: logstash: add #o11y tag to logstash alert descriptions [puppet] - 10https://gerrit.wikimedia.org/r/622161 [14:51:56] thx godog [14:52:03] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [14:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:16] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) [14:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:19] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [14:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:39] np herron, looks like the host was unhappy [14:52:48] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We also need to actually pool the ro records in the dc_to, and depool them in dc_from" [cookbooks] - 10https://gerrit.wikimedia.org/r/621304 (owner: 10RLazarus) [14:52:50] I'll file a task, not sure if there's anything to do [14:53:05] kk [14:53:52] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [14:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:56] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-update-tendril [14:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:10] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) [14:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:01] 10Operations, 10SRE-tools, 10tox-wikimedia, 10Patch-For-Review, 10User-Kormat: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Kormat) [14:55:03] 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10fgiunchedi) [14:55:29] RECOVERY - Host prometheus5001 is UP: PING OK - Packet loss = 0%, RTA = 228.72 ms [14:55:57] !log switchover test complete, puppet re-enabled on cumin1001 [14:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:30] (03PS4) 10Kormat: Add 'black' formatter support. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621724 (https://phabricator.wikimedia.org/T211750) [14:56:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:57:07] (03PS4) 10Kormat: Run 'black' against setup.py and wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621753 [14:57:09] (03PS4) 10Kormat: Run 'black' in CI against wmfmariadbpy. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621754 [14:57:45] (03CR) 10Ema: [C: 03+1] ATS: Disable ECDHE-RSA-AES128-SHA support [puppet] - 10https://gerrit.wikimedia.org/r/622138 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [14:58:26] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:09] PROBLEM - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [15:02:13] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable ECDHE-RSA-AES128-SHA support [puppet] - 10https://gerrit.wikimedia.org/r/622138 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [15:04:03] !log rolling restart of ats-tls to disable ECDHE-RSA-AES128-SHA - T258405 [15:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:07] T258405: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 [15:07:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [15:08:40] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:09] (03CR) 10Kormat: [C: 03+2] Run 'black' in CI against wmfmariadbpy. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621754 (owner: 10Kormat) [15:12:21] (03CR) 10Kormat: [C: 03+2] Run 'black' against setup.py and wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621753 (owner: 10Kormat) [15:12:24] (03CR) 10Kormat: [C: 03+2] Add 'black' formatter support. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621724 (https://phabricator.wikimedia.org/T211750) (owner: 10Kormat) [15:12:34] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10Papaul) Can you please depool and shutdown this server. I need to replace the raid controller [15:13:09] (03Merged) 10jenkins-bot: Add 'black' formatter support. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621724 (https://phabricator.wikimedia.org/T211750) (owner: 10Kormat) [15:13:11] 10Operations, 10ops-codfw: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10Papaul) @RLazarus any update? [15:13:14] (03Merged) 10jenkins-bot: Run 'black' against setup.py and wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621753 (owner: 10Kormat) [15:13:18] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:38] (03CR) 10Kormat: [V: 03+2 C: 03+2] Run 'black' in CI against wmfmariadbpy. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621754 (owner: 10Kormat) [15:13:40] (03Merged) 10jenkins-bot: Run 'black' in CI against wmfmariadbpy. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621754 (owner: 10Kormat) [15:16:56] (03PS1) 10Marostegui: install_server: Do not reimage db1128 [puppet] - 10https://gerrit.wikimedia.org/r/622164 (https://phabricator.wikimedia.org/T260324) [15:17:21] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10jcrespo) This can be considered "depooled" at all times until we close the ticket- no need to ask permission (unlike databases, for which we need to pool them back quite often... [15:17:48] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [15:17:51] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1128 [puppet] - 10https://gerrit.wikimedia.org/r/622164 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [15:18:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) this is waiting for ceph so that we can move/rebuild things without user downtime. [15:19:35] (03PS1) 10Ayounsi: Initial CNI config [homer/public] - 10https://gerrit.wikimedia.org/r/622165 (https://phabricator.wikimedia.org/T259036) [15:20:42] (03CR) 10Ayounsi: "Tested and works as expected." [homer/public] - 10https://gerrit.wikimedia.org/r/622165 (https://phabricator.wikimedia.org/T259036) (owner: 10Ayounsi) [15:20:54] !log shutdown backup2001 T260764 [15:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:58] T260764: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 [15:22:17] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10jcrespo) 05Open→03Resolved closing again, as it has not happened again. [15:23:42] (03CR) 10CDanis: Initial CNI config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/622165 (https://phabricator.wikimedia.org/T259036) (owner: 10Ayounsi) [15:24:54] (03CR) 10Ppchelko: api-gateway: strip cookie headers from requests and responses. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/620311 (https://phabricator.wikimedia.org/T259296) (owner: 10Hnowlan) [15:26:37] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:27:17] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): scap on beta fails canary check: KeyError: 'aggregations' - https://phabricator.wikimedia.org/T260667 (10LarsWirzenius) a:05LarsWirzenius→03None... [15:28:55] RECOVERY - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [15:30:49] (03PS1) 10RobH: absent user due to stolen laptop [puppet] - 10https://gerrit.wikimedia.org/r/622168 [15:31:36] (03CR) 10jerkins-bot: [V: 04-1] absent user due to stolen laptop [puppet] - 10https://gerrit.wikimedia.org/r/622168 (owner: 10RobH) [15:32:32] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:33:14] (03CR) 10CDanis: [C: 03+1] Initial CNI config [homer/public] - 10https://gerrit.wikimedia.org/r/622165 (https://phabricator.wikimedia.org/T259036) (owner: 10Ayounsi) [15:34:57] (03CR) 10Ayounsi: [C: 03+2] Initial CNI config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/622165 (https://phabricator.wikimedia.org/T259036) (owner: 10Ayounsi) [15:35:27] (03Merged) 10jenkins-bot: Initial CNI config [homer/public] - 10https://gerrit.wikimedia.org/r/622165 (https://phabricator.wikimedia.org/T259036) (owner: 10Ayounsi) [15:39:27] 10Operations, 10Traffic: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 (10Vgutierrez) [15:42:31] (03PS2) 10RobH: absent user due to stolen laptop [puppet] - 10https://gerrit.wikimedia.org/r/622168 [15:43:25] (03CR) 10jerkins-bot: [V: 04-1] absent user due to stolen laptop [puppet] - 10https://gerrit.wikimedia.org/r/622168 (owner: 10RobH) [15:45:07] (03PS3) 10RobH: absent user due to stolen laptop [puppet] - 10https://gerrit.wikimedia.org/r/622168 (https://phabricator.wikimedia.org/T261139) [15:48:36] (03PS4) 10RobH: remove kzeta key/groups due to stolen laptop [puppet] - 10https://gerrit.wikimedia.org/r/622168 (https://phabricator.wikimedia.org/T261139) [15:50:15] (03CR) 10RobH: [C: 03+2] remove kzeta key/groups due to stolen laptop [puppet] - 10https://gerrit.wikimedia.org/r/622168 (https://phabricator.wikimedia.org/T261139) (owner: 10RobH) [15:50:39] win 47 [15:50:44] win some, lose some [15:59:53] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) [16:00:13] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:05:49] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10Papaul) 05Open→03Resolved 1- RAID controller drivers updated 2- Replaced RAiD controller 3- upgrade IDRAC [16:06:09] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:06:09] (03CR) 10Bstorm: wikireplicas: refactor to eliminate confusing "labsdb" naming (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621618 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [16:07:49] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:07:49] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:09:33] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [16:09:54] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [16:10:16] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [16:13:47] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.39 ms [16:13:47] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [16:14:18] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10RobH) taken from sre meeting notes: > Will send out an email regarding PDU upgrades (parent task T253694) during the DC failover at eqiad in rows C and D, with proposed... [16:18:20] (03CR) 10Ebernhardson: Multiple instances of msearch_daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [16:35:49] (03CR) 10Dzahn: "I think it would be helpful if there was a ticket for the access request part of this specifically, which is also tagged as access-request" [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:36:49] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:46:16] (03PS7) 10Bstorm: wikireplicas: refactor to eliminate confusing "labsdb" naming [puppet] - 10https://gerrit.wikimedia.org/r/621618 (https://phabricator.wikimedia.org/T260843) [16:54:02] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10Bstorm) [16:55:13] (03PS4) 10Bstorm: cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) [16:58:53] (03CR) 10Bstorm: "https://phabricator.wikimedia.org/T261145" [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:59:18] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10Bstorm) a:05Bstorm→03None [17:00:04] gehel and onimisionipe: That opportune time is upon us again. Time for a Wikidata Query Service weekly deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200824T1700). [17:00:21] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:01:34] (03CR) 10Hnowlan: api-gateway: strip cookie headers from requests and responses. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/620311 (https://phabricator.wikimedia.org/T259296) (owner: 10Hnowlan) [17:05:37] (03CR) 10Jcrespo: wikireplicas: refactor to eliminate confusing "labsdb" naming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621618 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:06:17] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:07:21] (03CR) 10Ppchelko: [C: 03+1] api-gateway: strip cookie headers from requests and responses. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/620311 (https://phabricator.wikimedia.org/T259296) (owner: 10Hnowlan) [17:07:28] (03CR) 10Bstorm: wikireplicas: refactor to eliminate confusing "labsdb" naming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621618 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:08:40] (03CR) 10Jcrespo: "As much as we would like to deploy this, this is currently not possible. See blockers at: https://phabricator.wikimedia.org/T224589#617601" [puppet] - 10https://gerrit.wikimedia.org/r/621370 (owner: 10Dzahn) [17:12:26] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10jcrespo) @Papaul, for the record, you got sent a new RAID controller from vendor (I wasn't aware of that if true)? [17:16:14] (03CR) 10Dzahn: "oh, i wasn't aware of that. thank you, yea, no big deal then, i will just abandon. i was doing a bunch of jessie-removal changes, not just" [puppet] - 10https://gerrit.wikimedia.org/r/621370 (owner: 10Dzahn) [17:16:50] (03Abandoned) 10Dzahn: tendril: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621370 (owner: 10Dzahn) [17:18:13] (03PS2) 10Dzahn: prometheus: replace hiera() with lookup() in several exporters [puppet] - 10https://gerrit.wikimedia.org/r/621771 [17:26:13] 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10CBogen) [17:28:26] (03PS8) 10Bstorm: wikireplicas: refactor to eliminate confusing "labsdb" naming [puppet] - 10https://gerrit.wikimedia.org/r/621618 (https://phabricator.wikimedia.org/T260843) [17:31:30] (03CR) 10Bstorm: wikireplicas: refactor to eliminate confusing "labsdb" naming (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621618 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:33:19] (03CR) 10Jcrespo: [C: 03+1] "This looks ok to me- honestly with such a large change, it is likely we will find some minor issues with wrong paths or templates, but mys" [puppet] - 10https://gerrit.wikimedia.org/r/621618 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:35:27] PROBLEM - Host logstash2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:37:36] (03CR) 10Bstorm: "PCC for latest patch set: https://puppet-compiler.wmflabs.org/compiler1002/24617/" [puppet] - 10https://gerrit.wikimedia.org/r/621618 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:41:32] (03PS1) 10RLazarus: site: move mc2037 to role mediawiki::memcached [puppet] - 10https://gerrit.wikimedia.org/r/622178 [17:41:47] (03PS3) 10Dzahn: prometheus: replace hiera() with lookup(), add data types for node lookups [puppet] - 10https://gerrit.wikimedia.org/r/621759 [17:43:28] (03PS2) 10RLazarus: site: move mc2037 to role mediawiki::memcached [puppet] - 10https://gerrit.wikimedia.org/r/622178 (https://phabricator.wikimedia.org/T260224) [17:43:30] (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace hiera() with lookup(), add data types for node lookups [puppet] - 10https://gerrit.wikimedia.org/r/621759 (owner: 10Dzahn) [17:44:02] (03PS1) 10Effie Mouzeli: hiera: change opcached settings for testing [puppet] - 10https://gerrit.wikimedia.org/r/622179 (https://phabricator.wikimedia.org/T261009) [17:44:13] RECOVERY - Host logstash2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.99 ms [17:44:56] (03PS4) 10Dzahn: prometheus: hiera() -> lookup(), add data type for prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/621759 [17:45:59] (03CR) 10jerkins-bot: [V: 04-1] prometheus: hiera() -> lookup(), add data type for prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/621759 (owner: 10Dzahn) [17:46:45] (03CR) 10DCausse: [C: 03+1] Increase weight of grants and research in metawiki search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621767 (https://phabricator.wikimedia.org/T260569) (owner: 10Ebernhardson) [17:48:29] (03CR) 10Dzahn: "sorry reviewers, added you a bit early. i am not even sure yet why jenkins gives V-1" [puppet] - 10https://gerrit.wikimedia.org/r/621759 (owner: 10Dzahn) [17:55:27] (03PS5) 10Dzahn: prometheus: hiera() -> lookup(), add data type for prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/621759 [17:59:30] (03PS2) 10Dzahn: openstack: replace hiera() with lookup() in several places [puppet] - 10https://gerrit.wikimedia.org/r/621779 [18:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200824T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:44] (03PS3) 10Dzahn: openstack: replace hiera() with lookup() in several places [puppet] - 10https://gerrit.wikimedia.org/r/621779 [18:02:37] (03PS2) 10Ebernhardson: Increase weight of grants and research in metawiki search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621767 (https://phabricator.wikimedia.org/T260569) [18:02:51] (03CR) 10Dzahn: "ok, found and fixed the issue. ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/621759 (owner: 10Dzahn) [18:02:59] (03CR) 10Ebernhardson: [C: 03+2] "deploying in backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621767 (https://phabricator.wikimedia.org/T260569) (owner: 10Ebernhardson) [18:03:52] (03Merged) 10jenkins-bot: Increase weight of grants and research in metawiki search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621767 (https://phabricator.wikimedia.org/T260569) (owner: 10Ebernhardson) [18:03:59] 10Operations, 10serviceops: Replace mc2028 with mc2037 in production - https://phabricator.wikimedia.org/T261154 (10jijiki) [18:05:41] (03PS2) 10Dzahn: zuul: add data types, replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/621758 [18:05:59] (03PS1) 10Effie Mouzeli: mcrouter/redis: replace mc2028 with mc2037 [puppet] - 10https://gerrit.wikimedia.org/r/622180 (https://phabricator.wikimedia.org/T261154) [18:08:27] (03PS2) 10Effie Mouzeli: hiera: change mwdebug1001 opcached settings for testing [puppet] - 10https://gerrit.wikimedia.org/r/622179 (https://phabricator.wikimedia.org/T261009) [18:10:09] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cirrus: Increase weight of grants and research namespaces in metawiki search (duration: 00m 58s) [18:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:13] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/24624/contint1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/621758 (owner: 10Dzahn) [18:11:41] (03CR) 10Dzahn: "well, i would have to compile this on basically * to be really sure. i could do that but would also block the compiler for a while. should" [puppet] - 10https://gerrit.wikimedia.org/r/621759 (owner: 10Dzahn) [18:43:21] Amir1: https://phabricator.wikimedia.org/T261133#6407297 [18:44:45] RECOVERY - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [18:45:23] RhinosF1: thanks! [18:45:59] If they really want to implement it, they need to get an approval from the board [18:46:00] Amir1: np, I messaged Urbanecm as well. Consensus ain't much when no sane deployer is going to deploy it [18:46:09] Yeah thatd change things [18:46:23] But go completely against the mission so I doubt it'd fly [18:47:17] link for anyone's reference: https://meta.wikimedia.org/wiki/Requesting_wiki_configuration_changes, "Ultimate authority lies with the system administrators, and site configuration requests may be declined for any reason" [18:48:06] yeah for example, we are not going to deploy shorturl or dynamic page lists extension to any new wiki no matter what the community consensus is because they will cause outages [18:48:34] If they want that sort of change, they need to speak to.... someone at the foundation [18:48:40] T&S/Legal type? [18:48:50] likely the board c/o legal [18:48:59] ^ [18:49:08] with shorturl, they could write a relatively simple gadget that uses w.wiki and generates it for every page someone visits [18:49:18] Definitely feels a slippery slope [18:49:23] "Well, X wiki did it... Why can't we?" [18:49:41] Urbanecm: there's a website for that [18:49:48] You can bookmark a link [18:49:49] link? [18:49:59] Urbanecm: cdanis shared it the other day [18:50:50] the general approach of "bookmarking a link that is a javascript: URL" is called 'bookmarklet' -- and yeah I did find one that someone had made for specifically w.wiki [18:51:11] Google for [w.wiki bookmarklet] had it at #2 :) [18:51:24] Urbanecm: generating a short for every page view someone visits sounds horrible :D [18:51:24] https://edg2s.github.io/w.wiki-bookmarklet/ [18:51:29] please don't do that [18:51:37] Amir1: no, just whenever you click the bookmark [18:51:40] ah, bookmarklet - i thought we talk about a website :D [18:51:52] cdanis: amir was reacting to this >= with shorturl, they could write a relatively simple gadget that uses w.wiki and generates it for every page someone visits [18:51:58] cdanis: I know, Urbanecm suggested something else [18:52:00] ah [18:52:02] nvm [18:52:03] something sinister [18:52:16] Amir1: I'm not going to do that or suggest that, I'm just warning someone might try that [18:52:43] if someone tries that, we would be in trouble :D [18:52:47] I hope no one does [18:56:09] Someone once spawned that many processes due to a broke script that was updating discord server info to a page twice a second [18:56:17] That was a fun chat [18:58:19] well, let's not talk about the fact that a bot alone was responsible for 17% of all logins across all of our infra [18:58:43] was? that's fixed? cool! [18:58:53] yeah, it's fixed [18:59:01] https://phabricator.wikimedia.org/T256533 [18:59:01] cool [18:59:17] do/should we throttle logins? [18:59:52] bots bypass ratelimit AFAIK [18:59:53] Amir1: I've seen a bot that became the top requestor by a mile on Miraheze cleaning sandboxes on a wiki with 2/3 edits a day [19:00:12] lol [19:00:45] Amir1: I had a fun answer for that. Fix it or kill it. [19:00:59] Amir1: we can do `&can-bypass=false` [19:01:13] yup, I'm planning the block one of the bots actually [19:01:30] wdym? you're planing to block all the bots? [19:02:05] no the bot that logs in 35K times a day and hasn't fixed it since I pinged the operator a couple of times [19:02:39] ah, i see [19:02:55] ListeriaBot? or which one? [19:02:56] Urbanecm: hmm, yeah, it's slightly problematic to use &can-bypass=false, it has caused several unknown issues once we enabled it for edits in wikidata [19:03:08] i can lock that for 72 hours or so, should be enough for the owner to notice the notice [19:03:12] Mr.Ibrahembot [19:03:36] I did notify the operator several times https://phabricator.wikimedia.org/T256533#6383310 [19:05:09] where can i find the bucket btw? [19:05:14] (if it's auto-updating) [19:06:01] Urbanecm: https://logstash.wikimedia.org/goto/5870ce6c973f60ca68c665c13cb5c81b [19:06:11] it seems it's still logging in 40K times a day... [19:06:28] The bot should be blocked [19:10:30] (change visibility) 21:10, 24 August 2020 Martin Urbanec talk contribs block changed status for global account "User:Mr.Ibrahembot@global": set locked; unset (none) (Forcing bot shut-down: please fix https://ar.wikipedia.org/wiki/%D9%85%D9%88%D8%B6%D9%88%D8%B9:Vp2mudl4jr44aduw / T256533 in your bot ) [19:10:32] T256533: Identify accounts with very high login rate - https://phabricator.wikimedia.org/T256533 [19:10:38] agreed, done [19:10:50] 10Operations, 10serviceops: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) [19:12:03] Urbanecm: Thanks! [19:12:28] no problem. I've informed the user via https://w.wiki/aGF [19:15:12] (03CR) 10Southparkfan: "Emanuele or Moritz, can one of you review the change again? I've changed the directory to /run for every server now :)" [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan) [19:16:07] edit conflict lol [19:17:23] not with flow : [19:17:30] :D [19:17:30] or whatever it is called those days [19:22:10] (03CR) 10Subramanya Sastry: [C: 03+1] "Let us please go with this .. otherwise, we have to perennially keep updating and syncing the VE clone of this code which will look differ" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) (owner: 10C. Scott Ananian) [19:23:58] (03PS1) 10CDanis: package_builder: add support for 'sloppy' backports [puppet] - 10https://gerrit.wikimedia.org/r/622190 [19:24:13] (03CR) 10Effie Mouzeli: "PCC for this patch + 622178" [puppet] - 10https://gerrit.wikimedia.org/r/622180 (https://phabricator.wikimedia.org/T261154) (owner: 10Effie Mouzeli) [19:25:06] !log disabling puppet on 'R:File = /etc/nutcracker/nutcracker.yaml' to swap mc2028 out for mc2037 T261154 [19:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:11] T261154: Replace mc2028 with mc2037 in production - https://phabricator.wikimedia.org/T261154 [19:25:17] 10Operations, 10serviceops: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) p:05Triage→03High [19:25:26] * that should have said nutcracker.yml, editing [19:26:28] (03PS2) 10CDanis: package_builder: add support for 'sloppy' backports [puppet] - 10https://gerrit.wikimedia.org/r/622190 [19:26:30] (03CR) 10Effie Mouzeli: "As we found out, this patch per se leads to puppet errors on the host, as the host is not included in the redis nutcracker shards. This PP" [puppet] - 10https://gerrit.wikimedia.org/r/622178 (https://phabricator.wikimedia.org/T260224) (owner: 10RLazarus) [19:28:24] (03PS3) 10RLazarus: site: move mc2037 to role mediawiki::memcached [puppet] - 10https://gerrit.wikimedia.org/r/622178 (https://phabricator.wikimedia.org/T261154) [19:29:07] (03CR) 10CDanis: "PCC lgtm: https://puppet-compiler.wmflabs.org/compiler1003/24630/deneb.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/622190 (owner: 10CDanis) [19:29:29] (03CR) 10RLazarus: [C: 03+2] site: move mc2037 to role mediawiki::memcached [puppet] - 10https://gerrit.wikimedia.org/r/622178 (https://phabricator.wikimedia.org/T261154) (owner: 10RLazarus) [19:29:56] (03CR) 10RLazarus: [C: 03+2] mcrouter/redis: replace mc2028 with mc2037 [puppet] - 10https://gerrit.wikimedia.org/r/622180 (https://phabricator.wikimedia.org/T261154) (owner: 10Effie Mouzeli) [19:31:26] (03PS4) 10BryanDavis: wmcs: collect prometheus metrics from alertmanager in metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/620760 [19:32:27] (03PS1) 10Ahmon Dancy: WIP: Add support for 'dev' realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622192 [19:33:28] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add support for 'dev' realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622192 (owner: 10Ahmon Dancy) [19:34:06] (03PS1) 10Ahmon Dancy: WIP: Add support for dev realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [19:34:56] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add support for dev realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [19:39:13] (03Abandoned) 10Ahmon Dancy: WIP: Add support for 'dev' realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622192 (owner: 10Ahmon Dancy) [19:45:33] 10Operations, 10LDAP-Access-Requests: Product Analytics/Superset Access: - https://phabricator.wikimedia.org/T261160 (10Rileych) [19:46:26] 10Operations, 10LDAP-Access-Requests: Product Analytics/Superset Access: LDAP access to the wmf group for Chelsea Riley - https://phabricator.wikimedia.org/T261160 (10Rileych) [20:00:04] halfak and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200824T2000). [20:00:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10wiki_willy) a:03Jclark-ctr [20:01:15] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [20:06:29] (03PS8) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) [20:29:01] !log re-enabled puppet on 'R:File = /etc/nutcracker/nutcracker.yml' T261154 [20:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:06] T261154: Replace mc2028 with mc2037 in production - https://phabricator.wikimedia.org/T261154 [20:33:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:34:43] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:42:39] PROBLEM - Memcached on mc2037 is CRITICAL: connect to address 10.192.32.40 and port 11211: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:44:05] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 62 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:47:51] (03PS1) 10BryanDavis: dynamicproxy: Fix missing proxy handling [puppet] - 10https://gerrit.wikimedia.org/r/622195 (https://phabricator.wikimedia.org/T258730) [20:49:59] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:53:21] 10Operations, 10serviceops: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) Step 1 was to gather the data. Here is a table with " host - weight - hwtype", done in wiki syntax because then we get a **sortable** table which I could semi-auto... [20:53:45] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/compiler1003/24631/" [puppet] - 10https://gerrit.wikimedia.org/r/622195 (https://phabricator.wikimedia.org/T258730) (owner: 10BryanDavis) [20:58:02] 10Operations, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Platform Team Sprints Board (Sprint 1), and 2 others: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10Nuria) @eprodromou In order to know how/if removal of cookies will affect metrics we would n... [20:59:53] RECOVERY - Memcached on mc2037 is OK: TCP OK - 0.033 second response time on 10.192.32.40 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [21:00:04] Reedy and sbassett: Time to snap out of that daydream and deploy Weekly Security deployment window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200824T2100). [21:10:24] 10Operations, 10serviceops: Memcached is listening to 127.0.0.1 after first puppet runs - https://phabricator.wikimedia.org/T261164 (10jijiki) [21:13:06] (03PS3) 10Effie Mouzeli: hiera: change mwdebug1001 opcache settings for testing [puppet] - 10https://gerrit.wikimedia.org/r/622179 (https://phabricator.wikimedia.org/T261009) [21:17:08] (03PS5) 10Effie Mouzeli: helmfile: add values for staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/621605 (https://phabricator.wikimedia.org/T256973) [21:21:00] Going to deploy a quick change to an existing security mitigation in /private. [21:21:41] 10Operations, 10observability, 10Patch-For-Review, 10good first task: nagios-nrpe-server.service: systemd unit references path below legacy directory /var/run/ - https://phabricator.wikimedia.org/T252990 (10Southparkfan) I have uploaded a new patch using /run on all servers (regardless of OS). However, wha... [21:21:44] sbassett: Thank you :-) [21:22:30] (03CR) 10Southparkfan: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan) [21:25:35] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: change mwdebug1001 opcache settings for testing [puppet] - 10https://gerrit.wikimedia.org/r/622179 (https://phabricator.wikimedia.org/T261009) (owner: 10Effie Mouzeli) [21:26:01] 10Operations, 10serviceops: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) Now let's look closer and the hardware details of the types above. We will ignore the ones scheduled for decom before the switch and that 1 special case for now. `... [21:29:31] !log sbassett@deploy1001 Synchronized private/PrivateSettings.php: Deployed additional mitigations for T257687 (duration: 00m 58s) [21:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:14] 10Operations, 10serviceops: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) Based on the info above I would suggest to: - have the same weight for all servers in the 2016 class, not partially 10 and partially 20 - have a higher weight the n... [21:36:32] 10Operations, 10serviceops: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) changes this would need: mw2187 - mw2199: lower weight from 10 to 0 (decom) mw2224 - mw2242 - lower weight from 20 to 10 (to match with 2254 - 2258) mw2268 - mw227... [21:47:32] 10Operations, 10ops-codfw, 10Patch-For-Review: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10RLazarus) [21:47:59] 10Operations, 10decommission-hardware, 10serviceops: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10RLazarus) [21:49:26] 10Operations, 10ops-codfw, 10Patch-For-Review: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10RLazarus) @Papaul Sorry for the delay, we only just got mc2037 up and serving. Opened T261168 for the decom, starting the service owner steps now. [21:56:46] 10Operations, 10ops-codfw, 10Patch-For-Review: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10Papaul) 05Open→03Resolved @RLazarus thanks for the update. we can resolve this task now since the decom task is already open. Thanks. [21:59:52] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) The Delivery ETA for this is 08/31/20 so it is not possible to have those servers by 2020-08-31. [22:00:28] I cannot access any WMF website with my VPN turned on. It worked flawlessly until like 10 minutes ago. Is there something I can do? [22:01:07] <_joe_> aehm [22:04:14] (Aside from disabling that, I mean. As long as that's expected) [22:06:44] just tried with enabling my VPN and i can connect just fine [22:06:52] sounds like that would be specific to your provider [22:07:22] (03CR) 10Bstorm: [C: 03+2] "Let's try it! I can always roll back if it gets ugly." [puppet] - 10https://gerrit.wikimedia.org/r/622195 (https://phabricator.wikimedia.org/T258730) (owner: 10BryanDavis) [22:09:20] Yeah I guessed that. I can share the IP with you folks if you want to check. [22:10:28] Daimona_: ideally, please follow https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [22:10:32] !log rzl@cumin1001 START - Cookbook sre.hosts.decommission [22:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:07] 10Operations, 10ops-eqiad, 10DC-Ops: decommission samarium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T197630 (10wiki_willy) a:05ayounsi→03Jclark-ctr [22:12:28] Thank you, I'll do that tomorrow. It's pretty late here and it might as well fix itself in the meantime :-) [22:13:17] !log rzl@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [22:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:21] 10Operations, 10decommission-hardware, 10serviceops: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by rzl@cumin1001 for hosts: `mc2028.codfw.wmnet` - mc2028.codfw.wmnet (**FAIL**) - Failed downtime host on Icinga... [22:13:55] 10Operations, 10ops-eqiad, 10DC-Ops: decommission samarium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T197630 (10wiki_willy) 05Resolved→03Open @ayounsi - I'm reopening this task, since the server hasn't been unracked yet. [22:16:18] (03PS1) 10CDanis: depool esams for router upgrades [dns] - 10https://gerrit.wikimedia.org/r/622226 [22:16:58] 10Operations, 10serviceops: Memcached is listening to 127.0.0.1 after first puppet runs - https://phabricator.wikimedia.org/T261164 (10jijiki) p:05Triage→03Low [22:17:59] 10Operations, 10serviceops: Replace mc2028 with mc2037 in production - https://phabricator.wikimedia.org/T261154 (10jijiki) p:05Triage→03Medium [22:18:12] 10Operations, 10serviceops: Replace mc2028 with mc2037 in production - https://phabricator.wikimedia.org/T261154 (10jijiki) 05Open→03Resolved a:03jijiki Server is in production, closing [22:19:14] 10Operations, 10LDAP-Access-Requests: Product Analytics/Superset Access: LDAP access to the wmf group for Chelsea Riley - https://phabricator.wikimedia.org/T261160 (10jijiki) p:05Triage→03Medium [22:19:16] (03PS1) 10RLazarus: site, install_server: Remove mc2028 for decom [puppet] - 10https://gerrit.wikimedia.org/r/622227 (https://phabricator.wikimedia.org/T261168) [22:19:32] 10Operations, 10LDAP-Access-Requests: Product Analytics/Superset Access: LDAP access to the wmf group for Chelsea Riley - https://phabricator.wikimedia.org/T261160 (10jijiki) a:03jijiki [22:20:35] 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10jijiki) p:05Triage→03High [22:20:45] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10jijiki) p:05Triage→03Medium a:03jijiki [22:20:47] 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10jijiki) [22:21:34] 10Operations, 10Wikimedia-Mailing-lists: Disable google code in mailinglists - https://phabricator.wikimedia.org/T261084 (10jijiki) p:05Triage→03Medium a:03jijiki [22:23:34] 10Operations, 10Wikimedia-Mailing-lists: Several unreadable mailing list descriptions due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10jijiki) p:05Triage→03Medium [22:23:39] 10Operations, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10RLazarus) [22:24:23] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:25:33] 10Operations, 10Wikimedia-Mailing-lists: Disable google code in mailinglists - https://phabricator.wikimedia.org/T261084 (10Dzahn) The difference is that "disable" can be reverted and does not delete the config of the list, including the list of subscribers. "remove" is permanent and actually removes the list... [22:27:04] 10Operations, 10Wikimedia-Mailing-lists: Disable google code in mailinglists - https://phabricator.wikimedia.org/T261084 (10Dzahn) So to know which one you want you basically just have to answer the question if you want to re-enable it later and still have the same list of people subscribed to it and their add... [22:28:21] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:29:40] (03CR) 10Dzahn: [C: 04-1] "i think typo in [0-79]" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622227 (https://phabricator.wikimedia.org/T261168) (owner: 10RLazarus) [22:30:25] that's valid, it matches any digit but 8 :) will change since it's not obvious though [22:30:40] mutante: ^ oops, meant to be in -serviceops [22:31:29] it will match 0-7 and 9? ok then [22:32:04] (03PS2) 10RLazarus: site, install_server: Remove mc2028 for decom [puppet] - 10https://gerrit.wikimedia.org/r/622227 (https://phabricator.wikimedia.org/T261168) [22:32:56] (03CR) 10Dzahn: [C: 03+1] site, install_server: Remove mc2028 for decom [puppet] - 10https://gerrit.wikimedia.org/r/622227 (https://phabricator.wikimedia.org/T261168) (owner: 10RLazarus) [22:33:12] not even sure if we have to care about always have exact matching regex for these cases, tbh [22:33:21] looks good [22:33:23] (03CR) 10RLazarus: site, install_server: Remove mc2028 for decom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622227 (https://phabricator.wikimedia.org/T261168) (owner: 10RLazarus) [22:33:29] thanks! [22:33:58] (03CR) 10RLazarus: [C: 03+2] site, install_server: Remove mc2028 for decom [puppet] - 10https://gerrit.wikimedia.org/r/622227 (https://phabricator.wikimedia.org/T261168) (owner: 10RLazarus) [22:40:54] (03PS1) 10RLazarus: Remove production DNS for mc2028, in decom [dns] - 10https://gerrit.wikimedia.org/r/622230 (https://phabricator.wikimedia.org/T261168) [22:42:04] (03CR) 10Dzahn: [C: 03+1] Remove production DNS for mc2028, in decom [dns] - 10https://gerrit.wikimedia.org/r/622230 (https://phabricator.wikimedia.org/T261168) (owner: 10RLazarus) [22:43:28] (03CR) 10RLazarus: [C: 03+2] Remove production DNS for mc2028, in decom [dns] - 10https://gerrit.wikimedia.org/r/622230 (https://phabricator.wikimedia.org/T261168) (owner: 10RLazarus) [22:46:41] 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10wiki_willy) a:03RobH [22:47:50] 10Operations, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10RLazarus) [22:47:51] 10Operations, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10RLazarus) a:05RLazarus→03Papaul [22:48:01] 10Operations, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10RLazarus) @Papaul Over to you, thanks! [23:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200824T2300) [23:00:04] subbu: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:19] \o/ [23:00:22] here. [23:00:27] 10Operations, 10ops-eqiad, 10DC-Ops: decommission samarium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T197630 (10Jclark-ctr) removed from rack updated netbox Resolving ticket previous comments list ports already removed [23:00:29] 10Operations, 10ops-eqiad, 10DC-Ops: decommission samarium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T197630 (10Jclark-ctr) 05Open→03Resolved [23:00:43] subbu: do you want to self-service, or should i deploy? [23:01:01] can you please? :) [23:01:06] sure! [23:01:38] 10Operations, 10ops-eqiad, 10DC-Ops: Check samarium status in Netbox - https://phabricator.wikimedia.org/T260772 (10Jclark-ctr) 05Open→03Resolved https://phabricator.wikimedia.org/T197630. corrected netbox removed host from rack [23:02:03] (03CR) 10Urbanecm: [C: 03+2] Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) (owner: 10C. Scott Ananian) [23:02:32] (03Merged) 10jenkins-bot: Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) (owner: 10C. Scott Ananian) [23:06:29] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 778f710bbbdb24730f7ce4c75d5ff1ca7a5ce3b3: Alternate configuration mechanism for Parsoid (T241961) (duration: 00m 58s) [23:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:33] T241961: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 [23:06:43] subbu: it should be live :) [23:06:50] ty. let me verify. [23:07:02] thanks, just was going to ask you for that :) [23:08:10] lgtm. [23:08:14] all done :) thanks. [23:08:37] great! happy to help :) [23:16:33] !log Evening B&C window done [23:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:53] This page took 53 seconds to parse on cache miss [23:29:54] https://www.mediawiki.org/wiki/Manual:Coding_conventions/PHP [23:29:58] I wonder what's that's all about [23:30:54] 10Operations, 10Documentation: Improve documentation for mirrors.wikimedia.org - https://phabricator.wikimedia.org/T179856 (10Dzahn) There is https://wikitech.wikimedia.org/wiki/Mirrors [23:32:25] 19 seconds to parse https://www.mediawiki.org/wiki/Manual:Coding_conventions/JavaScript [23:32:28] Krinkle maybe the use syntaxhighlighting? I seem to recall something similar with a page on enwiki that also used syntaxhighlighting heavily and that making it slow to load [23:32:35] 4 seconds to parse https://www.mediawiki.org/wiki/Manual:PHP_unit_testing/Writing_unit_tests [23:35:07] DannyS712: are you aware of a task or discussion on-wiki somewhere about it? [23:36:18] I was thinking of https://phabricator.wikimedia.org/T233990 [23:37:09] 08Warning Alert for device cr2-esams.wikimedia.org - Memory over 85% [23:38:05] DannyS712: I see, syntax highlight seems to parse fairly quickly in isolation from what I've tried incl on enwiki [23:38:09] Perhaps it's [23:39:32] meh, in isolation that parses fine as well, incl when I copy that whole page into a sandbox and preview it with and without tanslate. [23:42:32] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Marc) * Could something like [[https://www.mediawiki.org/wiki/Manual:PurgeList.php|mw:Manual:PurgeList.php]] offer a possible solution here? Maybe com... [23:44:22] (03PS1) 10BryanDavis: dynamicproxy: serve default /robots.txt and /favicon.ico for Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/622237 (https://phabricator.wikimedia.org/T251628) [23:44:24] (03PS1) 10BryanDavis: dynamicproxy: allow service workers in Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/622238 (https://phabricator.wikimedia.org/T158216) [23:44:34] (03CR) 10CDanis: [C: 03+2] depool esams for router upgrades [dns] - 10https://gerrit.wikimedia.org/r/622226 (owner: 10CDanis) [23:44:39] (03PS2) 10CDanis: depool esams for router upgrades [dns] - 10https://gerrit.wikimedia.org/r/622226 [23:46:52] !log depool esams T259621 [23:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:08] 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) So we likely need to run a CPU test via the Dell testing suite, and that will require downtime of the node. AFAICT the directions for this are on: htt... [23:52:41] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Krinkle) MediaWiki req [23:54:46] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 57.05 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:56:20] (03PS2) 10BryanDavis: dynamicproxy: serve default /robots.txt and /favicon.ico for Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/622237 (https://phabricator.wikimedia.org/T251628) [23:56:34] (03PS2) 10BryanDavis: dynamicproxy: allow service workers in Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/622238 (https://phabricator.wikimedia.org/T158216) [23:57:10] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Memory over 85%