[00:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190425T0000). [00:00:31] (03CR) 10Dzahn: "a possible long-term fix would be to use ensure_packages() or require_package() in both of the places" [puppet] - 10https://gerrit.wikimedia.org/r/506329 (owner: 10Bstorm) [00:00:52] Yeaaaah, but core dumps directory is not accessible [00:01:07] ugh [00:01:26] also does not reproduce if I run any script that runs php instead of running php directly [00:01:47] what the actual f [00:02:48] MaxSem: what's the core dump directory [00:03:10] $ cat /proc/sys/kernel/core_pattern [00:03:10] /var/tmp/core/core.%h.%e.%p.%t [00:03:39] smalyshev wikidev 317M Apr 24 23:59 core.mwmaint1002.php.123986.1556150346 [00:03:46] do you want that file? [00:03:57] where is it? [00:04:19] /var/tmp/core/core.mwmaint1002.php.123986.1556150346 [00:04:21] on mwmaint1002 [00:04:28] mutante: if you could put it in my home dir I could try to see if there's any sense in it [00:04:32] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/506329 (owner: 10Bstorm) [00:04:43] ah wait I have access there [00:04:44] good [00:04:46] let me see [00:04:52] yea, i was about to say it looks owned by you [00:04:56] ok, cool [00:05:31] Hmm, so if you know the precise file name you can read it... [00:06:02] core.mwmaint1002.php.121093.1556150012 core.mwmaint1002.php.120194.1556149832 [00:06:11] the 2 before that [00:06:15] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/506329 (owner: 10Bstorm) [00:06:22] (gdb) p argv[2] [00:06:23] $5 = 0x7fb971ca31a0 "/srv/mediawiki-staging/php-1.34.0-wmf.1/extensions/CirrusSearch/tests/phan/stubs/intl.php" [00:06:32] ok this is something already [00:07:28] now to figure out what makes hhvm barf on it [00:07:46] can't get a much simpler php file [00:07:49] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1002/16008/ the toolforge example." [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [00:07:54] yup confirmed php -l /srv/mediawiki-staging/php-1.34.0-wmf.1/extensions/CirrusSearch/tests/phan/stubs/intl.php drops dead [00:08:33] yeah this is pretty dumb file [00:08:34] https://3v4l.org/Dq206 [00:08:40] replicable on there [00:08:41] Process exited with code 134. [00:08:41] lol [00:09:08] ohh so if you try to define class which already exists in hhvm it silently drops dead [00:09:10] nice [00:09:31] especially nice in lint mode of course [00:09:42] nice [00:09:51] MaxSem: Just sync the files you need rather than the whole dir? :P [00:10:16] MaxSem: the only file I actually need is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/506209/1/includes/Maintenance/AnalysisConfigBuilder.php [00:10:21] That's kinda a lot of them [00:10:26] all the rest is tests [00:10:59] the patch is one file of code change and tons of test fixtures [00:12:33] !log maxsem@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/CirrusSearch/includes/Maintenance/AnalysisConfigBuilder.php: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CirrusSearch/+/506209/ (duration: 00m 54s) [00:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:46] SMalyshev: ^ [00:12:57] MaxSem: thanks! [00:13:13] so I guess I should file a bug to phab about this? [00:13:25] or directly to hhvm? or both? not sure [00:13:48] kill hhvm [00:14:12] that's an option too but so far it runs as deployment gateway I see [00:14:20] HHVM wouldn't care about crap they don't support [00:15:02] well not dropping dead silently on simple class declaration may be something a proper language would want to support :) [00:15:16] then again, they probably only care for running Facebook code [00:15:40] They kinda don't support that pasky PHP anymore ;) [00:15:48] *pesky [00:16:30] yeah I know but I imagine that didn't change in their language - I still could make a class named "Transliterator" [00:17:40] Anyway, we're done here, thanks everyone. I'll file a ticket [00:21:15] MaxSem: I already did https://phabricator.wikimedia.org/T221814 - please feel free to add stuff there [00:24:46] (03PS1) 10Dzahn: labstore::fileserver::exports: convert to systemd service [puppet] - 10https://gerrit.wikimedia.org/r/506331 (https://phabricator.wikimedia.org/T194724) [00:29:49] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/16009/" [puppet] - 10https://gerrit.wikimedia.org/r/506331 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [00:40:44] (03CR) 10Dzahn: [C: 03+2] mariadb: replace phab1002 grant comments with phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/504964 (https://phabricator.wikimedia.org/T221389) (owner: 10Dzahn) [00:40:51] (03PS2) 10Dzahn: mariadb: replace phab1002 grant comments with phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/504964 (https://phabricator.wikimedia.org/T221389) [00:45:14] (03Abandoned) 10Dzahn: mariadb: replace phab1002 grant comments with phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/504964 (https://phabricator.wikimedia.org/T221389) (owner: 10Dzahn) [01:41:31] PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:41:35] PROBLEM - Nginx local proxy to apache on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:42:35] PROBLEM - HHVM rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:42:30] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10mobrovac) >>! In T220402#5136129, @Tarrow wrote: > We are indeed using service-runner; I don't think this provides /_info or /?... [03:24:57] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:05:07] PROBLEM - puppet last run on an-worker1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:13:20] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) a:03Smalyshev [04:24:21] 10Operations, 10decommission: Reclaim/Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10Andrew) [04:24:24] 10Operations, 10decommission: Reclaim/Decommission labnet1001, 1002 - https://phabricator.wikimedia.org/T221818 (10Andrew) [04:31:37] RECOVERY - puppet last run on an-worker1078 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:41:05] (03PS1) 10Andrew Bogott: Remove a bunch of old Horizon code [puppet] - 10https://gerrit.wikimedia.org/r/506344 [04:41:07] (03PS1) 10Andrew Bogott: Move labcontrol1001/1002 to role::spare and clean up references [puppet] - 10https://gerrit.wikimedia.org/r/506345 (https://phabricator.wikimedia.org/T221817) [04:41:09] (03PS1) 10Andrew Bogott: Move labnet1001, 1002 to role::spare, clean up other references [puppet] - 10https://gerrit.wikimedia.org/r/506346 (https://phabricator.wikimedia.org/T221818) [04:42:49] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/compiler1002/16010/" [puppet] - 10https://gerrit.wikimedia.org/r/506344 (owner: 10Andrew Bogott) [05:06:13] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 99 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [05:06:33] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 123 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [05:11:01] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [05:11:08] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2047 - https://phabricator.wikimedia.org/T221481 (10Marostegui) 05Open→03Resolved The failed disk is now ok: ` root@db2047:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E0DB0) Port Name: 1I Po... [05:20:39] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [05:22:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506347 (https://phabricator.wikimedia.org/T221782) [05:24:04] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506347 (https://phabricator.wikimedia.org/T221782) (owner: 10Marostegui) [05:25:07] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506347 (https://phabricator.wikimedia.org/T221782) (owner: 10Marostegui) [05:25:09] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 408, down: 0, shutdown: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:27:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 T221782 (duration: 00m 54s) [05:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:25] T221782: Fix revision special indexes and partitions on db1103:3314 and db1113:3316 - https://phabricator.wikimedia.org/T221782 [05:27:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506347 (https://phabricator.wikimedia.org/T221782) (owner: 10Marostegui) [05:28:27] !log Deploy schema change on db1103:3314 to fix revision table partitioning and indexing - T221782 [05:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:40:21] (03PS2) 10Marostegui: mariadb: Promote db2079 to s8 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/505697 (https://phabricator.wikimedia.org/T220170) [05:47:52] !log Start changing topology to make db2079 s8 codfw master - T220170 [05:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:58] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [05:50:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2079 to s8 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/505697 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [05:53:57] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 16 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [05:58:59] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:59:21] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:59:25] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:59:27] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1010 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:59:37] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 15 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [05:59:37] PROBLEM - WDQS HTTP Port on wdqs1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time [06:00:13] PROBLEM - Check systemd state on ms-be2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:00:17] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [06:00:34] (03PS2) 10Marostegui: db-codfw.php: Promote db2079 to s8 codfw master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505699 (https://phabricator.wikimedia.org/T220170) [06:00:45] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1010 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:00:47] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1010 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:00:57] RECOVERY - WDQS HTTP Port on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.031 second response time [06:02:15] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Promote db2079 to s8 codfw master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505699 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:03:15] (03Merged) 10jenkins-bot: db-codfw.php: Promote db2079 to s8 codfw master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505699 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:04:40] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Promote db2079 to s8 codfw master T220170 (duration: 00m 52s) [06:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:46] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [06:07:37] (03PS1) 10Marostegui: db-codfw.php: Repool db2080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506349 (https://phabricator.wikimedia.org/T216240) [06:09:27] RECOVERY - Check systemd state on ms-be2032 is OK: OK - running: The system is fully operational [06:11:38] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Repool db2080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506349 (https://phabricator.wikimedia.org/T216240) (owner: 10Marostegui) [06:12:28] (03CR) 10jenkins-bot: db-codfw.php: Promote db2079 to s8 codfw master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505699 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:12:37] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506349 (https://phabricator.wikimedia.org/T216240) (owner: 10Marostegui) [06:12:50] (03CR) 10jenkins-bot: db-codfw.php: Repool db2080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506349 (https://phabricator.wikimedia.org/T216240) (owner: 10Marostegui) [06:14:05] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2080 after onsite maintenance to upgrade BIOS and firmware - T216240 (duration: 00m 54s) [06:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:11] T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 [06:31:05] (03PS1) 10Marostegui: db-codfw.php: Reorganize s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506350 (https://phabricator.wikimedia.org/T220170) [06:36:11] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Reorganize s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506350 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:37:15] (03Merged) 10jenkins-bot: db-codfw.php: Reorganize s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506350 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:38:05] (03PS1) 10Marostegui: mariadb: Make db2080 candidate master for s8 codfw [puppet] - 10https://gerrit.wikimedia.org/r/506351 (https://phabricator.wikimedia.org/T220170) [06:38:43] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Reorganize s8 codfw - T220170 (duration: 00m 54s) [06:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:58] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [06:39:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Make db2080 candidate master for s8 codfw [puppet] - 10https://gerrit.wikimedia.org/r/506351 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:45:32] (03CR) 10jenkins-bot: db-codfw.php: Reorganize s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506350 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [07:01:49] !log Run compare.py for main tables between db2045 and db2080 T220170 [07:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:54] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [07:03:08] (03CR) 10Marostegui: [C: 03+1] profile::analytics::database::meta: add properties to my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/506179 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [07:08:02] (03CR) 10Marostegui: "Probably worth a puppet compiler run to make sure everything works as expected? specially after removing the backups_and_dbstore_multiinst" [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo) [07:10:10] (03CR) 10Marostegui: [C: 03+1] mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [07:35:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/506176 (owner: 10Ema) [07:42:14] (03PS1) 10Muehlenhoff: Update aliases [puppet] - 10https://gerrit.wikimedia.org/r/506355 (https://phabricator.wikimedia.org/T221125) [07:56:00] !log installing gnutls security updates [07:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:33] !log update statistics grants for dbprov1* on tendril [08:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:25] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) [08:23:31] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: create and enable alerting on docker_registry_ha - https://phabricator.wikimedia.org/T221759 (10fsero) 05Open→03Resolved probably worth to include more alerts like one for the swift replication between dcs, but closing it for no... [08:30:35] !log installing php5 security updates [08:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:19] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10Mathew.onipe) [08:43:23] 10Operations, 10Discovery-Search: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10Mathew.onipe) [08:44:21] (03CR) 10Effie Mouzeli: [C: 03+1] Remove unused "multi" thumbor handler [puppet] - 10https://gerrit.wikimedia.org/r/505837 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [08:45:10] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Ladsgroup) >>! In T99531#5136277, @Dzahn wrote: >> Also note since recently we now have wikibase.org (https://gerrit.wikimedia.org/r/c/operati... [08:45:19] (03PS1) 10Muehlenhoff: Remove misleading comment [puppet] - 10https://gerrit.wikimedia.org/r/506361 [08:47:41] 10Operations, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Create Cookbook to restart WDQS - https://phabricator.wikimedia.org/T221832 (10Mathew.onipe) [08:47:48] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10Mathew.onipe) [08:47:54] 10Operations, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Create Cookbook to restart WDQS - https://phabricator.wikimedia.org/T221832 (10Mathew.onipe) [08:49:26] (03CR) 10Muehlenhoff: [C: 03+2] Remove misleading comment [puppet] - 10https://gerrit.wikimedia.org/r/506361 (owner: 10Muehlenhoff) [08:50:31] (03PS1) 10Marostegui: wmnet: Update pcX-master CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/506362 [08:51:32] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Marostegui) 05Resolved→03Open @RobH @Cmjohnson there are still DNS entries for all these hosts: ` templates/wmnet:pc1... [08:54:59] (03CR) 10Jcrespo: [C: 03+1] wmnet: Update pcX-master CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/506362 (owner: 10Marostegui) [08:55:08] (03CR) 10Arturo Borrero Gonzalez: Update aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506355 (https://phabricator.wikimedia.org/T221125) (owner: 10Muehlenhoff) [08:55:13] (03CR) 10Marostegui: [C: 03+2] wmnet: Update pcX-master CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/506362 (owner: 10Marostegui) [08:57:11] (03CR) 10Arturo Borrero Gonzalez: Move labnet1001, 1002 to role::spare, clean up other references (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506346 (https://phabricator.wikimedia.org/T221818) (owner: 10Andrew Bogott) [08:58:17] (03CR) 10Arturo Borrero Gonzalez: Move labcontrol1001/1002 to role::spare and clean up references (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506345 (https://phabricator.wikimedia.org/T221817) (owner: 10Andrew Bogott) [08:59:39] (03PS19) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [08:59:41] (03PS5) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [09:00:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [09:01:10] (03PS2) 10Muehlenhoff: Update aliases [puppet] - 10https://gerrit.wikimedia.org/r/506355 (https://phabricator.wikimedia.org/T221125) [09:02:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/506327 (https://phabricator.wikimedia.org/T221806) (owner: 10Bstorm) [09:03:20] (03PS20) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [09:03:25] (03PS6) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [09:04:15] (03CR) 10Muehlenhoff: Move labcontrol1001/1002 to role::spare and clean up references (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506345 (https://phabricator.wikimedia.org/T221817) (owner: 10Andrew Bogott) [09:09:28] (03PS1) 10Effie Mouzeli: cxserver: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/506366 (https://phabricator.wikimedia.org/T213195) [09:11:43] (03PS2) 10Effie Mouzeli: cxserver: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/506366 (https://phabricator.wikimedia.org/T213195) [09:15:38] (03PS1) 10Fsero: registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) [09:16:42] (03PS3) 10Effie Mouzeli: cxserver: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/506366 (https://phabricator.wikimedia.org/T213195) [09:22:07] (03CR) 10Ema: [C: 04-1] "A few comments! The VTC test (fixed by hand addressing the comments added to the CR) fails with the following error:" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles) [09:25:33] (03PS2) 10Ema: cumin: add ATS production hosts to aliases [puppet] - 10https://gerrit.wikimedia.org/r/506177 (https://phabricator.wikimedia.org/T219967) [09:27:01] (03CR) 10Ema: [C: 03+2] cumin: add ATS production hosts to aliases [puppet] - 10https://gerrit.wikimedia.org/r/506177 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:27:51] (03PS1) 10Fsero: registryha: feat: new records for lvs [dns] - 10https://gerrit.wikimedia.org/r/506369 (https://phabricator.wikimedia.org/T221101) [09:29:49] (03CR) 10Vgutierrez: [C: 04-1] registryha: feat: introducing LVS configuration (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [09:32:09] (03PS4) 10Santhosh: Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) [09:32:11] (03CR) 10Santhosh: Redirect Google Translate any wiki source to mobile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) (owner: 10Santhosh) [09:40:19] (03PS3) 10Ema: debdeploy: make filter_services default to empty hash [puppet] - 10https://gerrit.wikimedia.org/r/506176 [09:41:26] (03CR) 10Ema: [C: 03+2] debdeploy: make filter_services default to empty hash [puppet] - 10https://gerrit.wikimedia.org/r/506176 (owner: 10Ema) [09:41:41] (03PS4) 10Ema: cache: distinguish between Varnish and ATS nodes [puppet] - 10https://gerrit.wikimedia.org/r/505815 (https://phabricator.wikimedia.org/T219967) [09:42:41] (03CR) 10Ema: [C: 03+2] cache: distinguish between Varnish and ATS nodes [puppet] - 10https://gerrit.wikimedia.org/r/505815 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:46:07] 10Operations: Integrate Stretch 9.6 point update - https://phabricator.wikimedia.org/T209260 (10MoritzMuehlenhoff) These updates have been fully deployed: ` apache2 gnutls28 ` [09:49:28] !log installing libcgroup security updates [09:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:53] (03PS1) 10Muehlenhoff: Add library hint for libcgroup [puppet] - 10https://gerrit.wikimedia.org/r/506372 [09:52:49] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libcgroup [puppet] - 10https://gerrit.wikimedia.org/r/506372 (owner: 10Muehlenhoff) [09:57:03] !log installing multipath-tools update from stretch point release [09:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:37] 10Operations: Integrate Stretch 9.6 point update - https://phabricator.wikimedia.org/T209260 (10MoritzMuehlenhoff) These updates have been fully deployed: ` libcgroup multipath-tools ` [10:03:10] (03CR) 10Alexandros Kosiaris: [C: 03+1] cxserver: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/506366 (https://phabricator.wikimedia.org/T213195) (owner: 10Effie Mouzeli) [10:03:22] (03Abandoned) 10Alexandros Kosiaris: cxserver: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/496383 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris) [10:09:42] (03PS1) 10Urbanecm: Change wikimaniawiki logo to Wikimania 2019 version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506374 (https://phabricator.wikimedia.org/T221829) [10:15:30] (03PS2) 10Fsero: registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) [10:15:57] (03PS21) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [10:15:59] (03PS7) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [10:17:12] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [10:17:14] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [10:20:51] (03PS3) 10Fsero: registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) [10:35:48] (03PS22) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [10:35:50] (03PS8) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [10:41:17] (03PS23) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [10:41:19] (03PS9) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [10:44:25] (03PS24) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [10:44:28] (03PS10) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [10:45:42] jouncebot, next [10:45:43] In 0 hour(s) and 14 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190425T1100) [10:46:42] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [10:51:37] (03PS25) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [10:51:39] (03PS11) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [10:52:33] (03PS1) 10Jbond: nflog: add logging prefix to firewall log entries [puppet] - 10https://gerrit.wikimedia.org/r/506377 (https://phabricator.wikimedia.org/T220987) [10:52:35] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [10:53:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The reverse RRs are missing. Should be in templates/10.in-addr.arpa" [dns] - 10https://gerrit.wikimedia.org/r/506369 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [10:53:57] (03CR) 10Jbond: [C: 03+2] nflog: add logging prefix to firewall log entries [puppet] - 10https://gerrit.wikimedia.org/r/506377 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [10:54:05] (03PS2) 10Jbond: nflog: add logging prefix to firewall log entries [puppet] - 10https://gerrit.wikimedia.org/r/506377 (https://phabricator.wikimedia.org/T220987) [10:55:09] Lucas_WMDE_, looks I'll be out of energy soon. No charger nearby. Don't be surprised if I won't write anything during swat time. [10:55:23] asdasdadasdasdadasdadasd~.~. [10:55:39] (03PS1) 10Mathew.onipe: elasticsearch: config file for aligning puppet config [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) [10:55:45] ^^cat [10:55:52] XDDD [10:57:04] LOL [10:57:07] (03PS4) 10Alexandros Kosiaris: registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [10:57:33] (03CR) 10Faidon Liambotis: [C: 04-1] coherence report: General improvements and rack checks (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [10:59:04] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: increase conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/506379 [10:59:14] (03PS26) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [10:59:16] (03PS12) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [10:59:37] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: increase conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/506379 (https://phabricator.wikimedia.org/T221760) [10:59:39] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: l3_agent: increase conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/506379 (https://phabricator.wikimedia.org/T221760) (owner: 10Arturo Borrero Gonzalez) [10:59:50] hi all [10:59:57] was I online or not? [11:00:03] I didn’t have my IRC client open, but it kept sending notifications [11:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190425T1100). [11:00:04] Lucas_WMDE and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] and it looks like I was visible for y’all? [11:00:13] anyways, o/ [11:00:16] and I can deploy the changes [11:00:24] Urbanecm: still around? [11:00:43] for a sec [11:00:47] at least [11:00:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice! A comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [11:01:00] OK let’s try your change first then [11:01:25] oh, there’s more changes on the calendar [11:01:25] 6 % of battery, unsure when it will auto-turn off [11:01:36] the Wikimania one looks less urgent [11:01:58] any final comment on the namespace number? [11:02:04] wdym? [11:02:04] (03PS3) 10Lucas Werkmeister (WMDE): Create new namespace "Edice" for cswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) (owner: 10Urbanecm) [11:02:11] I had another question on the patch [11:02:19] if there’s any semi-standard number for an Edition namespace [11:02:26] but if there isn’t one we can just merge this, I think [11:03:17] (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/16019/" [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [11:03:18] okay, let’s merge [11:03:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) (owner: 10Urbanecm) [11:04:26] (03Merged) 10jenkins-bot: Create new namespace "Edice" for cswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) (owner: 10Urbanecm) [11:04:37] Okay, notebook is out of energy [11:04:41] Here from mobile [11:05:13] change is on mwdebug1002 [11:05:20] you probably can’t test it very well on mobile? [11:05:50] (03CR) 10jenkins-bot: Create new namespace "Edice" for cswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) (owner: 10Urbanecm) [11:06:22] Not at all [11:06:23] I'm sorry [11:06:44] I’m testing getting namespaces|namespacealiases via the API [11:06:52] if they show up there, that should be enough, right? [11:06:57] Yes [11:06:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected https://puppet-compiler.wmflabs.org/compiler1002/16020/" [puppet] - 10https://gerrit.wikimedia.org/r/506379 (https://phabricator.wikimedia.org/T221760) (owner: 10Arturo Borrero Gonzalez) [11:06:58] (server is being very slow unfortunately) [11:07:12] Another way is to visit Edice:a [11:07:37] And check if discussion is "diskuse:edice:a" or "diskuse k edici:a" [11:07:51] First one is Talk:, second one Edition talk [11:07:55] the talk redlink points to "Diskuse k edici:A" [11:07:59] so that looks correct [11:08:06] That should be good [11:08:18] ok yay [11:08:21] good enough for me, deploying [11:08:28] Thx [11:10:13] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:506134|Create new namespace "Edice" for cswikisource (T221697)]] (duration: 00m 54s) [11:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:19] T221697: Create new namespace "Edice" for cs.wikisource - https://phabricator.wikimedia.org/T221697 [11:10:31] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript maintenance/namespaceDupes.php --wiki=cswikisource --fix [11:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:44] okay, nothing to fix, yay [11:11:09] Urbanecm: how important is the logo change? I’m not really comfortable with it yet [11:11:18] I don’t know where that logo is used and if the size change is okay [11:11:30] btw Pikne are you there? I think jouncebot didn’t ping you [11:11:40] hi [11:11:46] If you browse to wikimania.wikimedia.org, the logo on the left is what I changed [11:12:25] I have one to deploy but haven't yet put it on deployment schedule. [11:12:30] meanwhile, I’ve +2ed my backports [11:12:35] they’ll probably take half an hour in CI anyways :/ [11:12:47] Since wikimania is coming, I'd rather to have it deployed today than on Monday next week, but i don.t want to force you to do something you're not comfortable with [11:13:17] Urbanecm: https://wikimania.wikimedia.org/wiki/About still seems to show the old logo [11:13:27] You have to purge the url [11:13:29] ah [11:13:34] Let me try to search wikitech on mobile [11:13:46] no, I mean in the article content itself [11:13:55] above the Wikimedia Foundation logo [11:14:07] floating to the right of the “What is Wikimania” paragraph [11:15:06] okay I have too many questions now, I’m not deploying the logo change, sorry [11:15:22] https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging [11:15:23] Section one off purge [11:15:24] What do you mean, in content? [11:15:26] Okay [11:17:04] Pikne: do you have any estimate how long CI will take on your backport if I +2 it? [11:17:30] no, sorry [11:17:54] Tulsi: please add your change to the deployments calendar, I might be able to deploy it while waiting for CI on the backgports :) [11:17:57] looks like a nice and simple config change [11:18:01] RECOVERY - HP RAID on ms-be2034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [11:18:49] Pikne: okay, I’ve +2ed it, let’s see what happens… [11:18:54] * Lucas_WMDE_ watches the Zuul dashboard [11:19:13] (03PS1) 10Jbond: ulogd: fix mtail rules [puppet] - 10https://gerrit.wikimedia.org/r/506384 [11:19:55] (03CR) 10Jbond: [C: 03+2] ulogd: fix mtail rules [puppet] - 10https://gerrit.wikimedia.org/r/506384 (owner: 10Jbond) [11:21:50] (03CR) 10Lucas Werkmeister (WMDE): Add namespace "Aldono" at eo.wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505765 (https://phabricator.wikimedia.org/T221525) (owner: 10Tulsi Bhagat) [11:22:15] Pikne: oh, your backports have to wait for mine anyways due to the way gate-and-submit works [11:22:28] :/ [11:25:56] (03CR) 10Lucas Werkmeister (WMDE): "I don’t see this logo used on wikimaniawiki page content yet (e. g. About [1] still links to the old logo). Is there a source for this bei" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506374 (https://phabricator.wikimedia.org/T221829) (owner: 10Urbanecm) [11:26:07] Urbanecm: I left my questions on the Gerrit change [11:26:23] I've dont't really know how to estimate this, but I see that merging this to master branch took 13 min. So maybe it's around the same for this, after yours are done? [11:27:13] I was hoping it would get done before mine [11:27:24] but if it has to wait for them anyways it doesn’t really make a difference, I think [11:29:31] yup, done already (after 10min apparently) [11:29:38] just sitting in the queue waiting for my changes now [11:30:24] Lucas_WMDE, you pinged me, but i missed the message [11:30:29] Can you write it again, please? [11:30:37] Urbanecm: I left my questions on the Gerrit change [11:30:51] Ok. Will have a look later. [11:30:56] ok [11:37:52] Lucas_WMDE_: your first backport is on mwdebug1002, please test [11:39:12] (even though all three backports were already merged, I’m rebasing and scaping them one by one) [11:39:25] (though this does mean more waiting time on mwdebug1002 ☹) [11:40:49] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:41:21] looking good, the feature still works even though I disabled the beta feature [11:41:25] deploying my first backport [11:41:59] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 74283 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:42:54] ping Tulsi again – there *might* still be enough time to deploy your config change if you add it to the deployment calendar [11:43:00] (and respond to my question on Gerrit) [11:43:07] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/WikibaseQualityConstraints: SWAT: [[gerrit:505763|Enable constraint suggestions for everyone (T220609)]] (duration: 00m 59s) [11:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:12] T220609: Enable constraints suggestions for everyone and remove beta feature - https://phabricator.wikimedia.org/T220609 [11:43:13] okay, next backport [11:44:11] Lucas_WMDE_: the second backport is now on mwdebug1002, please test [11:44:39] yup, beta feature seems to be gone [11:45:07] and feature is still working, yay [11:45:09] deploying [11:45:56] Pikne: your backport is next, will you be able to test it? [11:45:59] (not yet, I’ll ping you again) [11:46:17] I hope so, yes. [11:46:24] ok great [11:46:36] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/WikibaseQualityConstraints: SWAT: [[gerrit:505764|Remove beta feature for constraint suggestions (T220609)]] (duration: 00m 56s) [11:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:50] Pikne: okay, your change should be on mwdebug1002 now, please test [11:48:21] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:49:10] (03CR) 10Effie Mouzeli: registryha: feat: introducing LVS configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [11:49:31] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 74297 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:49:50] hmm mwdebug1002 is flapping a bit [11:50:29] yeah earlier some hhvm was at 100% CPU [11:50:32] seems to be okay right now [11:50:53] Hm, I think that purging a page (or null edit) on page with a map should give the desired result, but it doesn't fix the problem. Can I purge a page cache on mwdebug1002? [11:51:11] not sure [11:51:20] copy the wikitext into a new userpage and try it there, perhaps? [11:51:32] (if that’s possible) [11:52:46] Right, it works this way. Should be good then. [11:52:51] ok great [11:52:53] deploying [11:52:55] thanks [11:53:47] (03PS1) 10Ema: cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) [11:54:35] (03CR) 10jerkins-bot: [V: 04-1] cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [11:54:42] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/Kartographer/: SWAT: [[gerrit:506363|Support data-mw="interface" also in staticframe (T221439)]] (duration: 00m 54s) [11:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:47] T221439: Static mapframe not clickable in 1.34.0-wmf.1 - https://phabricator.wikimedia.org/T221439 [11:54:59] alright, looks like that’s everything [11:55:03] pretty much right on time, too [11:55:14] !log EU SWAT done [11:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:50] Thanks! [11:57:13] (03PS27) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [11:57:15] (03PS13) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [11:57:17] (03PS1) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [11:57:53] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [11:58:49] (03CR) 10Effie Mouzeli: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [12:00:25] (03PS2) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [12:02:53] (03PS3) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [12:08:09] (03PS28) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [12:08:11] (03PS14) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [12:08:13] (03PS4) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [12:09:08] (03PS2) 10Ema: cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) [12:11:20] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/16024/" [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [12:12:05] PROBLEM - puppet last run on kafkamon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:41] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) [12:17:34] (03PS1) 10Vgutierrez: trafficserver: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 [12:18:29] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (owner: 10Vgutierrez) [12:20:58] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10MoritzMuehlenhoff) We can probably simply backport https://github.com/dpkp/kafka-python/pull/1628/commits/f12d4978e06c191871e092c190c2a34977f0c8bd on top of our 1.4.3... [12:22:30] (03PS2) 10Vgutierrez: trafficserver: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [12:23:03] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [12:31:26] (03PS3) 10Ema: cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) [12:32:53] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) a:03Ottomata Why not just upgrade to a newer version? Surely it's not the only bugfix in the past year. I've packaged the latest stable (1.4.6) without iss... [12:35:27] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506399 [12:36:41] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506399 (owner: 10Marostegui) [12:37:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506399 (owner: 10Marostegui) [12:38:29] RECOVERY - puppet last run on kafkamon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:39:38] (03PS4) 10Ema: cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) [12:39:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3314 T221782 (duration: 00m 53s) [12:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:58] T221782: Fix revision special indexes and partitions on db1103:3314 and db1113:3316 - https://phabricator.wikimedia.org/T221782 [12:39:58] (03PS1) 10Jbond: logstash: add ulog parser to logstash [puppet] - 10https://gerrit.wikimedia.org/r/506400 (https://phabricator.wikimedia.org/T220987) [12:40:27] (03CR) 10jerkins-bot: [V: 04-1] cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [12:40:55] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506399 (owner: 10Marostegui) [12:43:04] (03PS5) 10Ema: cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) [12:43:39] PROBLEM - puppet last run on cloudnet1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:43:58] (03PS2) 10Jbond: RAID: replace hpssacli with sscli [puppet] - 10https://gerrit.wikimedia.org/r/505760 (https://phabricator.wikimedia.org/T220787) [12:44:13] (03Abandoned) 10Jbond: RAID: replace hpssacli with sscli [puppet] - 10https://gerrit.wikimedia.org/r/504586 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:44:54] (03Abandoned) 10Jbond: ulogd logstash: Add rule to parse ulogd ouput to json [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [12:45:19] (03CR) 10Hashar: "Seems that python2 urllib on Jessie works fine, so at least Zuul would still be able to reach Gerrit over https." [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [12:45:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506401 (https://phabricator.wikimedia.org/T221782) [12:45:51] (03PS6) 10Ema: cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) [12:46:07] ? [12:46:34] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506401 (https://phabricator.wikimedia.org/T221782) (owner: 10Marostegui) [12:47:18] cloudnet1003 is fine, just checked [12:47:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506401 (https://phabricator.wikimedia.org/T221782) (owner: 10Marostegui) [12:48:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1113:3316 T221782 (duration: 00m 53s) [12:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:50] T221782: Fix revision special indexes and partitions on db1103:3314 and db1113:3316 - https://phabricator.wikimedia.org/T221782 [12:48:57] RECOVERY - puppet last run on cloudnet1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:49:34] (03PS3) 10Jbond: Canary roles: create a canary role for the aqs canary server [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) [12:50:31] (03CR) 10Jbond: [C: 03+2] Canary roles: create a canary role for the aqs canary server [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) (owner: 10Jbond) [12:52:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506401 (https://phabricator.wikimedia.org/T221782) (owner: 10Marostegui) [12:56:22] (03PS7) 10Ema: cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) [12:59:13] PROBLEM - puppet last run on aqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:14] (03PS1) 10Jbond: Revert "Canary roles: create a canary role for the aqs canary server" [puppet] - 10https://gerrit.wikimedia.org/r/506403 [13:05:05] (03CR) 10jerkins-bot: [V: 04-1] Revert "Canary roles: create a canary role for the aqs canary server" [puppet] - 10https://gerrit.wikimedia.org/r/506403 (owner: 10Jbond) [13:08:00] (03PS2) 10Jbond: Revert "Canary roles: create a canary role for the aqs canary server" [puppet] - 10https://gerrit.wikimedia.org/r/506403 [13:08:15] (03PS2) 10Mathew.onipe: elasticsearch: config file for aligning puppet config [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) [13:09:01] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: config file for aligning puppet config [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [13:09:15] (03CR) 10Jbond: [C: 03+2] Revert "Canary roles: create a canary role for the aqs canary server" [puppet] - 10https://gerrit.wikimedia.org/r/506403 (owner: 10Jbond) [13:12:31] (03PS3) 10Muehlenhoff: Update aliases [puppet] - 10https://gerrit.wikimedia.org/r/506355 (https://phabricator.wikimedia.org/T221125) [13:14:12] (03Abandoned) 10Effie Mouzeli: thumbor: refer prometheus.lua from updated location [puppet] - 10https://gerrit.wikimedia.org/r/492720 (https://phabricator.wikimedia.org/T216681) (owner: 10Mathew.onipe) [13:15:05] RECOVERY - puppet last run on aqs1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:15:25] (03PS3) 10Mathew.onipe: elasticsearch: config file for aligning puppet config [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) [13:19:26] (03CR) 10Gehel: "@Volans: after discussion with Moritz, he seems to think that the current implementation is good enough in our context. Do you have someth" [puppet] - 10https://gerrit.wikimedia.org/r/502829 (https://phabricator.wikimedia.org/T220625) (owner: 10Gehel) [13:24:48] 10Operations, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Create Cookbook to restart WDQS - https://phabricator.wikimedia.org/T221832 (10Lucas_Werkmeister_WMDE) (side note – you might as well combine the two restarts without sleep between them into a single `systemctl restart wdqs-blazegraph... [13:34:26] (03PS1) 10Ema: varnish: add reload_vcl_opts function [puppet] - 10https://gerrit.wikimedia.org/r/506409 (https://phabricator.wikimedia.org/T219967) [13:34:33] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Izno) [13:34:57] (03CR) 10jerkins-bot: [V: 04-1] varnish: add reload_vcl_opts function [puppet] - 10https://gerrit.wikimedia.org/r/506409 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:35:29] (03CR) 10Gehel: "Minor comment in line and needs to be actually tested, but LGTM" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [13:35:59] why can I never ever find what's wrong with my patches in jerkins' output? [13:36:20] (03CR) 10Gehel: Add maps postgres init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [13:37:38] (03PS2) 10Ema: varnish: add reload_vcl_opts function [puppet] - 10https://gerrit.wikimedia.org/r/506409 (https://phabricator.wikimedia.org/T219967) [13:39:19] (03CR) 10Effie Mouzeli: [V: 03+1] "Changes look ok https://puppet-compiler.wmflabs.org/compiler1002/16033/" [puppet] - 10https://gerrit.wikimedia.org/r/506366 (https://phabricator.wikimedia.org/T213195) (owner: 10Effie Mouzeli) [13:40:51] (03PS3) 10Ema: varnish: add reload_vcl_opts function [puppet] - 10https://gerrit.wikimedia.org/r/506409 (https://phabricator.wikimedia.org/T219967) [13:45:17] (03CR) 10Ema: "noop https://puppet-compiler.wmflabs.org/compiler1002/16035/" [puppet] - 10https://gerrit.wikimedia.org/r/506409 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:47:55] (03CR) 10Ema: [C: 03+2] varnish: add reload_vcl_opts function [puppet] - 10https://gerrit.wikimedia.org/r/506409 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:48:41] about to spam this channel with several commits (all of which jenkins will dislike), but I want them in gerrit so they are visible someplace [13:48:53] (03PS1) 10ArielGlenn: move from MySQLdb to pymysql [software] - 10https://gerrit.wikimedia.org/r/506410 [13:48:56] (03PS1) 10ArielGlenn: allow table checking to work with specified section [software] - 10https://gerrit.wikimedia.org/r/506411 [13:48:58] (03PS1) 10ArielGlenn: for checking tables per section, do only so many wikis, not all [software] - 10https://gerrit.wikimedia.org/r/506412 [13:49:00] (03PS1) 10ArielGlenn: ability to check tables on default section, by looking up databases served [software] - 10https://gerrit.wikimedia.org/r/506413 [13:49:02] (03PS1) 10ArielGlenn: show host info will now show the largest n wikis per requested section [software] - 10https://gerrit.wikimedia.org/r/506414 [13:49:06] spam done for today thank you [13:49:10] :) [13:49:49] (03PS1) 10Marostegui: site.pp: Remove pc1004-pc1006 [puppet] - 10https://gerrit.wikimedia.org/r/506415 (https://phabricator.wikimedia.org/T210969) [13:50:00] (03CR) 10jerkins-bot: [V: 04-1] allow table checking to work with specified section [software] - 10https://gerrit.wikimedia.org/r/506411 (owner: 10ArielGlenn) [13:50:05] (03CR) 10jerkins-bot: [V: 04-1] move from MySQLdb to pymysql [software] - 10https://gerrit.wikimedia.org/r/506410 (owner: 10ArielGlenn) [13:50:07] (03CR) 10jerkins-bot: [V: 04-1] for checking tables per section, do only so many wikis, not all [software] - 10https://gerrit.wikimedia.org/r/506412 (owner: 10ArielGlenn) [13:50:09] (03CR) 10jerkins-bot: [V: 04-1] ability to check tables on default section, by looking up databases served [software] - 10https://gerrit.wikimedia.org/r/506413 (owner: 10ArielGlenn) [13:50:19] (03CR) 10jerkins-bot: [V: 04-1] show host info will now show the largest n wikis per requested section [software] - 10https://gerrit.wikimedia.org/r/506414 (owner: 10ArielGlenn) [13:50:31] (03PS2) 10Marostegui: site.pp: Remove pc1004-pc1006 [puppet] - 10https://gerrit.wikimedia.org/r/506415 (https://phabricator.wikimedia.org/T210969) [13:51:34] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Marostegui) @RobH @Cmjohnson there are also entries on site.pp, I have sent a patch for that: https... [13:52:46] (03PS3) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [13:53:41] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [13:55:16] (03PS1) 10Marostegui: db2033.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/506416 (https://phabricator.wikimedia.org/T220070) [13:55:30] (03PS2) 10Marostegui: db2033.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/506416 (https://phabricator.wikimedia.org/T220070) [14:00:55] cdanis: I'm about to decom the host 'labcontrol1002' — that will break https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478225/ but I don't really understand what that's doing. Can I just remove those lines? [14:01:54] andrewbogott: if you're referring to the line referencing it in kernel.test, that's just sample syslog output to be parsed and checked by the unittest. decomming the host should not affect that [14:02:33] it's totally fine if hostnames referenced in the sample data don't exist [14:03:00] huh, ok then :) [14:03:07] I will ignore [14:03:54] 'kernel.test' is just syslog-formatted output to be parsed by 'kernel.mtail' and then have the results checked by 'kernel_test.py' [14:04:12] (03PS8) 10Ema: cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) [14:04:13] mtail is just a utility for parsing logs on the fly and recording counters about events you specified in its DSL [14:05:16] (03PS4) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:06:11] cdanis: "just" [14:06:15] :) [14:06:37] haha :) [14:10:13] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10BBlack) Re: `wikibase.org`, adding it as a non-canonical redirection to catch confusion from those that manually type URLs is fine, but we sho... [14:10:56] (03PS5) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:11:21] (03PS4) 10Muehlenhoff: Update aliases [puppet] - 10https://gerrit.wikimedia.org/r/506355 (https://phabricator.wikimedia.org/T221125) [14:13:55] 10Operations, 10decommission: Decommission labservices1001, 1002 - https://phabricator.wikimedia.org/T221857 (10Andrew) [14:14:17] (03CR) 10Muehlenhoff: [C: 03+2] Update aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506355 (https://phabricator.wikimedia.org/T221125) (owner: 10Muehlenhoff) [14:14:33] (03PS5) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [14:14:35] (03PS6) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:14:45] (03PS1) 10Andrew Bogott: Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506428 (https://phabricator.wikimedia.org/T221857) [14:14:59] 10Operations, 10Operations-Software-Development, 10cloud-services-team, 10Patch-For-Review: cumin aliases not matching any hosts - https://phabricator.wikimedia.org/T221125 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff [14:15:19] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): cumin: leaked aliases - https://phabricator.wikimedia.org/T221788 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff [14:16:15] (03CR) 10Andrew Bogott: [C: 04-2] "This must not be merged until all traffic is off of the DNS servers here. That will be true by 4:00PM my time today." [puppet] - 10https://gerrit.wikimedia.org/r/506428 (https://phabricator.wikimedia.org/T221857) (owner: 10Andrew Bogott) [14:16:52] (03PS9) 10Ema: cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) [14:17:08] (03PS7) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:19:17] (03PS2) 10Andrew Bogott: Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506428 (https://phabricator.wikimedia.org/T221857) [14:19:26] (03PS2) 10Andrew Bogott: Remove a bunch of old Horizon code [puppet] - 10https://gerrit.wikimedia.org/r/506344 [14:20:39] (03CR) 10Andrew Bogott: [C: 03+2] Remove a bunch of old Horizon code [puppet] - 10https://gerrit.wikimedia.org/r/506344 (owner: 10Andrew Bogott) [14:21:05] (03PS6) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [14:21:07] (03PS8) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:21:14] (03PS5) 10Fsero: registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) [14:22:37] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission betelgeuse.frack.codfw.wmnet - https://phabricator.wikimedia.org/T206870 (10Papaul) [14:23:00] (03PS10) 10Ema: cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) [14:23:40] (03PS7) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [14:23:42] (03PS9) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:28:29] (03PS29) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [14:28:31] (03PS15) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [14:28:33] (03PS8) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [14:28:36] (03PS10) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:28:41] (03PS1) 10Arturo Borrero Gonzalez: sssd: sudo: don't install sudo-ldap if using sssd [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [14:30:11] (03CR) 10jerkins-bot: [V: 04-1] sssd: sudo: don't install sudo-ldap if using sssd [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [14:31:52] (03PS1) 10Andrew Bogott: Remove delegation of 208.80.155.128-255 [dns] - 10https://gerrit.wikimedia.org/r/506436 (https://phabricator.wikimedia.org/T221183) [14:32:00] (03PS30) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [14:32:02] (03PS16) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [14:32:04] (03PS9) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [14:32:05] (03PS11) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:32:30] (03PS1) 10Jbond: canary host: test method for having canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/506437 (https://phabricator.wikimedia.org/T219803) [14:33:22] (03PS11) 10Ema: cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) [14:33:32] (03PS2) 10Arturo Borrero Gonzalez: sssd: sudo: don't install sudo-ldap if using sssd [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [14:33:38] (03CR) 10Ema: "pcc looks good https://puppet-compiler.wmflabs.org/compiler1002/16044/" [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [14:34:13] (03PS2) 10Jbond: canary host: test method for having canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/506437 (https://phabricator.wikimedia.org/T219803) [14:34:22] (03CR) 10jerkins-bot: [V: 04-1] sssd: sudo: don't install sudo-ldap if using sssd [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [14:34:36] (03CR) 10Ema: [C: 03+2] cache: move varnish etcd-based directors to profile [puppet] - 10https://gerrit.wikimedia.org/r/506389 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [14:35:14] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/506437 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [14:36:37] (03CR) 10Alex Monk: "I made Ie45778734de59f73295d7a366f43f4fd2504bb0b for this" [dns] - 10https://gerrit.wikimedia.org/r/506436 (https://phabricator.wikimedia.org/T221183) (owner: 10Andrew Bogott) [14:38:59] (03PS3) 10Arturo Borrero Gonzalez: sssd: sudo: don't install sudo-ldap if using sssd [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [14:39:39] (03Abandoned) 10Andrew Bogott: Remove delegation of 208.80.155.128-255 [dns] - 10https://gerrit.wikimedia.org/r/506436 (https://phabricator.wikimedia.org/T221183) (owner: 10Andrew Bogott) [14:40:13] (03PS4) 10Arturo Borrero Gonzalez: sssd: sudo: don't install sudo-ldap if using sssd [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [14:40:44] (03PS2) 10Andrew Bogott: Remove old labs 'main' region in-addr.arpa delegation [dns] - 10https://gerrit.wikimedia.org/r/505478 (https://phabricator.wikimedia.org/T221183) (owner: 10Alex Monk) [14:43:06] (03PS1) 10Papaul: DNS: Remove mgmt DNS for betelgeuse [dns] - 10https://gerrit.wikimedia.org/r/506443 [14:43:53] (03PS4) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [14:44:40] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [14:47:01] (03PS3) 10Andrew Bogott: Remove old labs 'main' region in-addr.arpa delegation [dns] - 10https://gerrit.wikimedia.org/r/505478 (https://phabricator.wikimedia.org/T221183) (owner: 10Alex Monk) [14:47:33] (03CR) 10Andrew Bogott: [C: 03+2] Remove old labs 'main' region in-addr.arpa delegation [dns] - 10https://gerrit.wikimedia.org/r/505478 (https://phabricator.wikimedia.org/T221183) (owner: 10Alex Monk) [14:47:53] (03PS1) 10Papaul: DNS: Remove mgmt and production DNS for rigel [dns] - 10https://gerrit.wikimedia.org/r/506444 [14:48:14] (03CR) 10Bstorm: [C: 03+2] cloudstore: add floating IP for the maps homes/project NFS [dns] - 10https://gerrit.wikimedia.org/r/506327 (https://phabricator.wikimedia.org/T221806) (owner: 10Bstorm) [14:48:21] (03PS3) 10Bstorm: cloudstore: add floating IP for the maps homes/project NFS [dns] - 10https://gerrit.wikimedia.org/r/506327 (https://phabricator.wikimedia.org/T221806) [14:48:37] (03PS1) 10Ema: conftool-data: cp4021 only ats-be in production [puppet] - 10https://gerrit.wikimedia.org/r/506445 (https://phabricator.wikimedia.org/T219967) [14:50:01] (03PS2) 10Ema: conftool-data: set cp4021 as the only ats-be in production [puppet] - 10https://gerrit.wikimedia.org/r/506445 (https://phabricator.wikimedia.org/T219967) [14:50:11] (03PS5) 10Arturo Borrero Gonzalez: sssd: sudo: don't install sudo-ldap if using sssd [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [14:51:35] !log update backup grants for dbprov1* on source dbs [14:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:56] (03PS3) 10Jcrespo: mariadb-backups: Setup dbprov eqiad servers, remove dbstore1001 backups [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) [14:52:46] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10Bstorm) 05Open→03Resolved [14:53:00] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10Papaul) [14:53:46] (03CR) 10Fsero: "PCC seems happy https://puppet-compiler.wmflabs.org/compiler1002/16049/" [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [14:54:50] (03PS6) 10Arturo Borrero Gonzalez: sssd: sudo: don't install sudo-ldap if using sssd [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [14:56:16] !log syncing facts for puppet compiler [14:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:51] (03CR) 10Andrew Bogott: sssd: sudo: don't install sudo-ldap if using sssd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [14:57:16] (03PS1) 10Thcipriani: gerrit: raise changeid_project cache [puppet] - 10https://gerrit.wikimedia.org/r/506452 (https://phabricator.wikimedia.org/T221026) [15:00:51] (03PS6) 10Bstorm: cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527) [15:01:51] (03CR) 10Bstorm: [C: 03+2] cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [15:03:07] (03PS12) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [15:03:26] (03CR) 10CDanis: [C: 03+2] gerrit: raise changeid_project cache [puppet] - 10https://gerrit.wikimedia.org/r/506452 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [15:03:43] (03PS2) 10CDanis: gerrit: raise changeid_project cache [puppet] - 10https://gerrit.wikimedia.org/r/506452 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [15:06:38] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10Papaul) [15:07:58] !log gerrit restart to pickup new cache config changes [15:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:36] gerrit down [15:09:37] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10WMDE-leszek) thanks for the write up @BBlack, I am going to take over the domain ownership topic from WMDE side, as it apparently has fallen t... [15:09:45] oh ^^^^ just saw the SAL [15:09:48] arturo: see log above [15:09:55] !log gerrit back [15:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:49] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2033 [dns] - 10https://gerrit.wikimedia.org/r/506466 [15:15:15] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [15:16:21] (03CR) 10Gehel: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [15:18:39] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/16054/tools-sgebastion-07.tools.eqiad.wmflabs/change.tools-sgebastion-07.tools.eqiad.wmfl" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [15:18:44] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [15:19:42] (03Restored) 10Paladox: Gerrit: Enable g1 gc as we now use java 8 [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [15:21:27] (03PS3) 10Jbond: canary host: test method for having canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/506437 (https://phabricator.wikimedia.org/T219803) [15:21:37] (03PS4) 10Paladox: Gerrit: Enable g1 gc as we now use java 8 [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) [15:22:28] (03PS5) 10Paladox: gerrit: Enable G1 GC [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T221026) [15:28:15] (03CR) 10Jcrespo: "> Probably worth a puppet compiler run to make sure everything works" [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo) [15:29:19] (03PS4) 10Jbond: canary host: test method for having canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/506437 (https://phabricator.wikimedia.org/T219803) [15:31:10] (03PS4) 10Jcrespo: mariadb-backups: Setup dbprov eqiad servers, remove dbstore1001 backups [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) [15:32:08] (03CR) 10Jbond: [C: 03+2] canary host: test method for having canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/506437 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:32:16] (03PS5) 10Jbond: canary host: test method for having canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/506437 (https://phabricator.wikimedia.org/T219803) [15:33:14] (03PS5) 10Jcrespo: mariadb-backups: Setup dbprov eqiad servers, remove dbstore1001 backups [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) [15:36:24] (03PS1) 10CDanis: codfw decom: halve weights again [software/swift-ring] - 10https://gerrit.wikimedia.org/r/506469 (https://phabricator.wikimedia.org/T221068) [15:39:04] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Setup dbprov eqiad servers, remove dbstore1001 backups [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo) [15:41:30] (03CR) 10CDanis: [V: 03+2 C: 03+2] codfw decom: halve weights again [software/swift-ring] - 10https://gerrit.wikimedia.org/r/506469 (https://phabricator.wikimedia.org/T221068) (owner: 10CDanis) [15:41:41] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:43:45] (03PS1) 10Bstorm: cloudstore: in stretch, the location of default nsswitch is different [puppet] - 10https://gerrit.wikimedia.org/r/506472 (https://phabricator.wikimedia.org/T209527) [15:45:03] (03PS2) 10Bstorm: cloudstore: in stretch, the location of default nsswitch is different [puppet] - 10https://gerrit.wikimedia.org/r/506472 (https://phabricator.wikimedia.org/T209527) [15:45:28] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Andrew) a:05Bstorm→03Andrew [15:45:55] (03CR) 10Bstorm: [C: 03+2] cloudstore: in stretch, the location of default nsswitch is different [puppet] - 10https://gerrit.wikimedia.org/r/506472 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [15:47:06] (03PS16) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [15:48:27] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Session storage Cassandra cluster configuration - https://phabricator.wikimedia.org/T215883 (10Eevans) 05Open→03Resolved a:03Eevans Considering this has since been... [15:49:04] !log depooling labweb1002 for easier debugging on labweb1001 [15:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:11] (03CR) 10CRusnov: "Removed asset dup check and added future date check." (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [15:52:21] RECOVERY - puppet last run on cloudstore1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:53:33] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [16:00:04] godog, _joe_, and mutante: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190425T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:04:20] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [16:06:40] (03PS17) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [16:06:42] (03PS10) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [16:06:44] (03PS13) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [16:07:15] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10BBlack) @WMDE-leszek Thanks for looking into it! I believe @CRoslof is who you want to coordinate with on our end, whose last statement on th... [16:08:49] (03PS1) 10CDanis: swift eqiad-prod: continue decom ms-be101[45] [software/swift-ring] - 10https://gerrit.wikimedia.org/r/506478 (https://phabricator.wikimedia.org/T220590) [16:09:32] (03CR) 10CDanis: [V: 03+2 C: 03+2] swift eqiad-prod: continue decom ms-be101[45] [software/swift-ring] - 10https://gerrit.wikimedia.org/r/506478 (https://phabricator.wikimedia.org/T220590) (owner: 10CDanis) [16:13:24] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10Papaul) [16:15:32] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [16:16:09] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5134204, @akosiaris wrote: > @Eevans @Clarakosi chart has been merged and is published. The on... [16:17:14] (03PS1) 10Ema: cache: multiple keyspaces support for directors.frontend.vcl [puppet] - 10https://gerrit.wikimedia.org/r/506480 (https://phabricator.wikimedia.org/T219967) [16:17:56] (03PS1) 10Jbond: canary host: test method for having canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/506481 (https://phabricator.wikimedia.org/T219803) [16:20:22] (03CR) 10Jbond: [C: 03+2] canary host: test method for having canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/506481 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [16:25:26] (03PS1) 10Jbond: canary testing: Role back this isn't working [puppet] - 10https://gerrit.wikimedia.org/r/506483 [16:26:35] (03CR) 10Jbond: [C: 03+2] canary testing: Role back this isn't working [puppet] - 10https://gerrit.wikimedia.org/r/506483 (owner: 10Jbond) [16:27:11] 10Operations, 10decommission, 10Patch-For-Review: Decommission labnet1001, 1002 - https://phabricator.wikimedia.org/T221818 (10Andrew) [16:27:23] 10Operations, 10decommission, 10Patch-For-Review: Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10Andrew) [16:27:51] !log repooled labweb1002 [16:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:05] (03PS1) 10Ema: cache: do not set backend_service [puppet] - 10https://gerrit.wikimedia.org/r/506484 (https://phabricator.wikimedia.org/T219967) [16:34:13] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [16:34:24] (03PS1) 10Andrew Bogott: cleanup puppet refs to labsdb1006/1007 [puppet] - 10https://gerrit.wikimedia.org/r/506485 (https://phabricator.wikimedia.org/T220144) [16:34:33] (03PS2) 10Ema: cache: do not set backend_service [puppet] - 10https://gerrit.wikimedia.org/r/506484 (https://phabricator.wikimedia.org/T219967) [16:34:47] (03PS2) 10Andrew Bogott: cleanup puppet refs to labsdb1006/1007 [puppet] - 10https://gerrit.wikimedia.org/r/506485 (https://phabricator.wikimedia.org/T220144) [16:34:55] !log rolling restart of Cassandra on restbase1016-1018 to pick up Java security update [16:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:41] (03CR) 10Andrew Bogott: [C: 03+2] cleanup puppet refs to labsdb1006/1007 [puppet] - 10https://gerrit.wikimedia.org/r/506485 (https://phabricator.wikimedia.org/T220144) (owner: 10Andrew Bogott) [16:37:51] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Andrew) a:05Andrew→03RobH [16:38:41] (03Abandoned) 10Andrew Bogott: test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499317 (owner: 10Andrew Bogott) [16:39:25] (03PS6) 10Jcrespo: mariadb-backups: Setup dbprov eqiad servers, remove dbstore1001 backups [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) [16:43:59] PROBLEM - puppet last run on ms-fe1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:46:31] (03CR) 10Ema: [C: 04-1] "pcc looks kind-of ok but for some reason 'cache_local.reconfigure();' is now gone :P" [puppet] - 10https://gerrit.wikimedia.org/r/506480 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [16:49:38] (03PS1) 10CRusnov: Add README.md and LICENSE.txt [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506489 [16:50:49] (03PS31) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [16:50:51] (03PS18) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [16:50:53] (03PS11) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [16:50:55] (03PS14) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [16:50:57] (03CR) 10Dzahn: [C: 03+1] "> My only concern is that Tyler and I are not around this week" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [16:52:20] (03PS2) 10Andrew Bogott: Move labcontrol1001/1002 to role::spare and clean up references [puppet] - 10https://gerrit.wikimedia.org/r/506345 (https://phabricator.wikimedia.org/T221817) [16:52:22] (03PS2) 10Andrew Bogott: Move labnet1001, 1002 to role::spare, clean up other references [puppet] - 10https://gerrit.wikimedia.org/r/506346 (https://phabricator.wikimedia.org/T221818) [16:52:24] (03PS3) 10Andrew Bogott: Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506428 (https://phabricator.wikimedia.org/T221857) [16:57:42] !log reorganize analytics firewall filters terms (description) on cr1/2-eqiad [16:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:58] !log add analytics firewall filter term schema to cr1/2-eqiad - T221690 [16:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:03] T221690: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 [17:00:04] cscott, arlolra, subbu, and halfak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190425T1700). [17:00:20] no parsoid deploy today [17:02:25] PROBLEM - Disk space on dbprov1001 is CRITICAL: DISK CRITICAL - /srv/backups/dumps/ongoing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:02:54] ^that is me, ongoing issue but not fatal (it is a monitoring problem) [17:03:06] rog [17:03:08] 10Operations, 10Analytics-Kanban, 10EventBus, 10netops: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10ayounsi) 05Open→03Resolved I assumed you needed HTTPS and not HTTP based on T219552, but please reopen if it's wrong. [17:04:27] ACKNOWLEDGEMENT - Disk space on dbprov1001 is CRITICAL: DISK CRITICAL - /srv/backups/dumps/ongoing is not accessible: Permission denied Jcrespo known issue with private partition monitoring T219399 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:04:27] ACKNOWLEDGEMENT - Disk space on dbprov1002 is CRITICAL: DISK CRITICAL - /srv/backups/dumps/ongoing is not accessible: Permission denied Jcrespo known issue with private partition monitoring T219399 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:04:56] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) [17:06:20] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) [17:07:00] (03PS1) 10Herron: admin: add foks to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/506492 (https://phabricator.wikimedia.org/T220860) [17:07:34] 10Operations, 10Analytics-Kanban, 10EventBus, 10netops: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10Ottomata) HTTP is enough for now, thanks. If/when this gets exposed publicly we'll put it through the usual frontend nginx tls stuff there. Thank you! [17:08:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts) - https://phabricator.wikimedia.org/T220860 (10herron) Since we're approaching two weeks on this request I've proposed the above patch to move forward... [17:08:57] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) 05Open→03Resolved This is now done, both servers are in production (although not with 100% of the final load, only all logic... [17:09:39] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Anomie) >>! In T219279#5096304, @Joe wrote: > 1. An opinion on... [17:09:58] (03CR) 10Muehlenhoff: [C: 03+1] Move labnet1001, 1002 to role::spare, clean up other references [puppet] - 10https://gerrit.wikimedia.org/r/506346 (https://phabricator.wikimedia.org/T221818) (owner: 10Andrew Bogott) [17:10:17] moritzm, have you seen 'Error: cannot create /dev/pts/ptmx device' before? [17:10:30] from stuff running in firejail [17:11:21] mmhh, let me think [17:12:08] Apr 25 17:08:58 deployment-imagescaler03 thumbor@8801[15600]: Error: cannot create /dev/pts/ptmx device [17:12:08] Apr 25 17:08:58 deployment-imagescaler03 thumbor@8801[15600]: Error: cannot establish communication with the parent, exiting... [17:12:08] Apr 25 17:08:58 deployment-imagescaler03 thumbor@8801[15600]: Parent pid 15600, child pid 15602 [17:12:08] Apr 25 17:08:58 deployment-imagescaler03 systemd[1]: thumbor@8801.service: Main process exited, code=exited, status=1/FAILURE [17:14:03] that's probably caused by the private-dev setting, let me checl [17:14:59] RECOVERY - puppet last run on ms-fe1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:16:19] got three instances all with thumbor, all running it through firejail with private-dev, all with /dev/pts/ptmx existing - but one of them dies with this problem [17:16:30] wondering if maybe this one has a different version of thumbor or something [17:17:41] they all have 6.3.2+git20170607-1+deb9u1, even one on jessie [17:18:15] having a look at the instance, private-dev is also used without issues in the prod instances, maybe it's a red herring [17:19:01] maybe because this is the one getting traffic directly from ms-fe... hmm [17:24:42] looks like you fixed it? [17:25:05] this was just a quick test to confirm whether it's in fact related to private-dev [17:25:17] commented it out and after a restart, thumbor works [17:25:27] decommented it again and failed [17:25:31] this is really strange [17:25:50] all the debs are identical to prod (for firejail and *thumbor*) [17:27:43] (03PS2) 10Faidon Liambotis: Add README.md and LICENSE.txt [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506489 (owner: 10CRusnov) [17:28:57] but something else is odd with the VM, it tries to pull in nodejs 8 from backports instead of the nodejs 6 we run in prod (and I doubt the 3d stuff works with it) [17:29:07] Krenair: this might be far fetched but when i read "maybe one has a different version of thumbor" i got reminded of this https://phabricator.wikimedia.org/T220342 [17:29:33] actually, nevermind.. that would just be rendering differences anyways [17:29:47] (03PS17) 10Faidon Liambotis: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [17:29:51] (03CR) 10Faidon Liambotis: [C: 03+1] coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [17:29:54] (03CR) 10Faidon Liambotis: [C: 03+1] Add README.md and LICENSE.txt [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506489 (owner: 10CRusnov) [17:30:07] I ended up restarting thumbor here because a config change made via puppet did not seem to cause a reload of the config [17:30:20] might not be the only thing missing from the puppet manifests [17:30:28] mutante: that's different, the component is around [17:30:34] but we are supposed to use nodejs10, not 8, right [17:31:04] for thumbor? [17:31:11] (03CR) 10CRusnov: [C: 03+2] Add README.md and LICENSE.txt [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506489 (owner: 10CRusnov) [17:31:37] that's what i ended up using on the parsoid test host .. we switched to the 10 component [17:32:15] so far only maps, turnilo and aqs are migrated AFAICT [17:32:21] PROBLEM - DPKG on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer [17:32:25] yep, thumbor* in prod uses 6 [17:32:27] (03PS3) 10CRusnov: Add README.md and LICENSE.txt [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506489 [17:32:31] PROBLEM - configured eth on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer [17:32:31] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Add README.md and LICENSE.txt [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506489 (owner: 10CRusnov) [17:32:42] not sure about thumbor..but the situation was also that first i tried to pull in version 8 from backports and then we switched it to 10 right away.. and ok [17:32:44] (03PS18) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [17:33:31] RECOVERY - DPKG on ms-be2033 is OK: All packages OK [17:33:32] (03CR) 10Muehlenhoff: [C: 03+1] Move labcontrol1001/1002 to role::spare and clean up references [puppet] - 10https://gerrit.wikimedia.org/r/506345 (https://phabricator.wikimedia.org/T221817) (owner: 10Andrew Bogott) [17:33:41] RECOVERY - configured eth on ms-be2033 is OK: OK - interfaces up [17:33:47] (03CR) 10CRusnov: [C: 03+2] coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [17:55:01] 10Operations, 10Patch-For-Review: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10herron) Looking at cumin1001 I noticed that the log prefix at the end of the input chan is "fw-out-drop" and the output chain is empty with an accept policy. Is "out" indeed the direction in this case? O... [18:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190425T1800). [18:00:04] stephanebisson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:28] Hello [18:01:57] I'll SWAT (or at least try to given the issue with T221860) [18:01:58] T221860: AssertionError: false === true at thereShouldBeALinkToCreateMyUserPage on wmf-quibble PHP jobs - https://phabricator.wikimedia.org/T221860 [18:14:56] (03PS1) 10CRusnov: Fix minor error in date compare [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506498 [18:15:52] (03CR) 10CRusnov: [C: 03+2] Fix minor error in date compare [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506498 (owner: 10CRusnov) [18:16:12] (03PS2) 10Dzahn: DNS: Remove mgmt DNS for betelgeuse [dns] - 10https://gerrit.wikimedia.org/r/506443 (owner: 10Papaul) [18:16:52] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS for betelgeuse [dns] - 10https://gerrit.wikimedia.org/r/506443 (owner: 10Papaul) [18:18:00] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission betelgeuse.frack.codfw.wmnet - https://phabricator.wikimedia.org/T206870 (10Dzahn) [18:19:04] (03PS2) 10Dzahn: DNS: Remove mgmt and production DNS for rigel [dns] - 10https://gerrit.wikimedia.org/r/506444 (owner: 10Papaul) [18:19:32] andrewbogott, FYI the SWAT window is open, looks like jouncebot didn't ping us [18:20:48] (03CR) 10Dzahn: [C: 03+2] "FRACK host that has been shutdown long time ago" [dns] - 10https://gerrit.wikimedia.org/r/506444 (owner: 10Papaul) [18:21:23] Krenair: I'm here but assume there will be a ping when our turn comes up [18:21:58] yeah... jenkins is/was having some problems so we'll see [18:22:28] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decom rigel.frack.codfw.wmnet - https://phabricator.wikimedia.org/T202535 (10Dzahn) [x] confirmed gone from puppet repo [x] remove prod DNS [x] remove mgmt DNS [18:23:36] (03PS2) 10Ottomata: Refactor eventgate-analytics to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) [18:23:52] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506316 (https://phabricator.wikimedia.org/T221260) (owner: 10Sbisson) [18:23:54] (03PS2) 10Dzahn: DNS: Remove mgmt DNS for db2033 [dns] - 10https://gerrit.wikimedia.org/r/506466 (owner: 10Papaul) [18:24:17] !log sbisson@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/GrowthExperiments/includes/EventLogging/SpecialHomepageLogger.php: SWAT: [[gerrit:506210|EventLogging: Make namespace int, use enum for impact module state]] (duration: 00m 54s) [18:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:39] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS for db2033 [dns] - 10https://gerrit.wikimedia.org/r/506466 (owner: 10Papaul) [18:25:12] (03CR) 10Sbisson: Cleanup old EchoCrossWikiBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506316 (https://phabricator.wikimedia.org/T221260) (owner: 10Sbisson) [18:25:17] (03PS2) 10Sbisson: Cleanup old EchoCrossWikiBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506316 (https://phabricator.wikimedia.org/T221260) [18:25:28] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506316 (https://phabricator.wikimedia.org/T221260) (owner: 10Sbisson) [18:25:38] 10Operations, 10ops-codfw, 10decommission: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10Dzahn) [18:26:26] (03Merged) 10jenkins-bot: Cleanup old EchoCrossWikiBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506316 (https://phabricator.wikimedia.org/T221260) (owner: 10Sbisson) [18:26:56] (03CR) 10Papaul: [C: 03+2] db2033.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/506416 (https://phabricator.wikimedia.org/T220070) (owner: 10Marostegui) [18:28:19] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission betelgeuse.frack.codfw.wmnet - https://phabricator.wikimedia.org/T206870 (10Papaul) 05Open→03Resolved complete [18:28:44] (03CR) 10Dzahn: [C: 03+1] db2033.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/506416 (https://phabricator.wikimedia.org/T220070) (owner: 10Marostegui) [18:28:54] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: Rack/Setup frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T196417 (10Papaul) [18:29:00] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decom rigel.frack.codfw.wmnet - https://phabricator.wikimedia.org/T202535 (10Papaul) 05Open→03Resolved complete [18:30:07] (03PS3) 10Dzahn: db2033.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/506416 (https://phabricator.wikimedia.org/T220070) (owner: 10Marostegui) [18:31:04] (03PS1) 10CRusnov: Enhancements to coherence report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506502 [18:31:06] !log sbisson@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:506316|Cleanup old EchoCrossWikiBetaFeature]] (1/2) (duration: 00m 54s) [18:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:21] 10Operations, 10ops-codfw, 10decommission: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10Papaul) 05Open→03Resolved Complete [18:31:26] andrewbogott, Krenair: You have a patch in this SWAT window. Can one of you do it or do you want me to? [18:31:36] I can't' [18:32:31] Isn't that OSM patch at risk of breaking things (more than they're already broken)? [18:32:48] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:506316|Cleanup old EchoCrossWikiBetaFeature]] (2/2) (duration: 00m 53s) [18:32:48] stephanebisson: I'd like you to if you're available. [18:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:05] (03PS2) 10CRusnov: Enhancements to coherence report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506502 [18:33:13] (03PS2) 10Smalyshev: Enable revision fetches in production [puppet] - 10https://gerrit.wikimedia.org/r/504990 (https://phabricator.wikimedia.org/T217897) [18:33:14] James_F: it fixes a known issue, and we've tested with three user accounts [18:33:21] PROBLEM - Check systemd state on ms-be1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:33:27] James_F: I wouldn't expect it to break things that aren't already broken [18:33:33] ^ famous last words [18:33:48] andrewbogott: What about the duplicate accounts issue bd808 spent ages fixing? [18:34:02] what about them? [18:34:22] We just fixed them. That patch is in the same area. Are we sure it doesn't break that? [18:34:30] (03CR) 10jenkins-bot: Cleanup old EchoCrossWikiBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506316 (https://phabricator.wikimedia.org/T221260) (owner: 10Sbisson) [18:34:37] James_F: This is a followup change to that fix [18:34:37] RECOVERY - Check systemd state on ms-be1034 is OK: OK - running: The system is fully operational [18:34:39] 10Operations, 10Analytics-Kanban, 10EventBus, 10netops: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10Ottomata) Hm, @ayounsi: `lang=shell [@stat1004:/home/otto] $ curl -Iv http://schema.svc.eqiad.wmnet:8190/repositories/ * Trying 10.2.2.43... [@stat... [18:34:48] bryan made things more case-sensitive, this continues that trend [18:34:59] (03PS3) 10CRusnov: Enhancements to coherence report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506502 [18:35:02] OK, as long as we're sure. [18:35:24] If you want to delay merging so you can review and test I welcome your input :) [18:35:37] 10Operations, 10Analytics-Kanban, 10EventBus, 10netops: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10Ottomata) Ah I think that might have been my fault. T219552 doesn't specify a port; that task description was made before service was implemented. P... [18:36:03] stephanebisson: I have the privs to merge but haven't done one in ages [18:36:08] I don't see how this would break that [18:36:13] 10Operations, 10Analytics-Kanban, 10EventBus, 10netops: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10Ottomata) 05Resolved→03Open [18:37:09] (03PS4) 10Faidon Liambotis: Further fixes to the coherence report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506502 (owner: 10CRusnov) [18:37:39] PROBLEM - HP RAID on ms-be1030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.26: Connection reset by peer [18:37:41] (03PS5) 10Faidon Liambotis: Further fixes to the coherence report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506502 (owner: 10CRusnov) [18:37:46] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10Papaul) [18:38:05] (03CR) 10Faidon Liambotis: [C: 03+1] Further fixes to the coherence report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506502 (owner: 10CRusnov) [18:38:34] (03CR) 10CRusnov: [C: 03+2] Further fixes to the coherence report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506502 (owner: 10CRusnov) [18:41:18] It seem to be going in the right direction but I'm not comfortable deploying it unless someone from security can give it a thumbs up. [18:42:00] Hmm? [18:42:04] * bawolff reads backscroll [18:42:49] bawolff: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OpenStackManager/+/506482 [18:46:06] (03PS1) 10CRusnov: Yet more Coherence fixes [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506515 [18:46:24] bd808: if around could you join the discussion of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OpenStackManager/+/506482 ? [18:46:31] 10Operations, 10ops-codfw, 10decommission: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10Reedy) [18:47:03] yeah, that's out of my area of expertise [18:49:28] bawolff: there's not anyone on the security team who has ever looked at OpenStackManager is there? [18:50:01] chasemp or reedy are probably the most likely candidates [18:51:19] I doubt it. [18:53:07] At this point it's probably Andrew, Bryan, and myself [18:54:54] * andrewbogott pushes the patch back to the next swat window [18:55:16] well [18:55:21] I already have a session working anyway [18:55:31] So I'm going to leave this alone until I next have to log into wikitech and it breaks again [18:57:44] Sorry, I didn't want to be a blocker. Someone more informed than me needs to deploy this. [18:58:32] (03CR) 10CRusnov: [C: 03+2] Yet more Coherence fixes [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506515 (owner: 10CRusnov) [18:59:47] alternatively maybe no one else will run into it before it rolls out with the train [19:00:37] (03CR) 10Dzahn: [C: 04-1] "one is left with the "iniciatives" spelling.. comment inline" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [19:01:31] * bd808 missed a debate on a needed change apparently [19:05:44] (03Abandoned) 10Dzahn: wikiba.se: add HSTS header with low max_age [puppet] - 10https://gerrit.wikimedia.org/r/500711 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [19:07:21] jouncebot: refresh [19:07:21] I refreshed my knowledge about deployments. [19:12:47] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 9.002 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [19:17:54] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) >>! In T99531#5137543, @BBlack wrote: > Re: `wikibase.org`, adding it as a non-canonical redirection to catch confusion from those that... [19:18:49] bd808: it's not too late for a +1 [19:19:06] andrewbogott: I gave it and an explanation :) [19:19:18] oh, thanks! I guess I'm not a reviewer so didn't notice [19:19:30] jouncebot: next [19:19:30] In 3 hour(s) and 40 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190425T2300) [19:19:56] If I wasn't hip deep in annual planning I would JFDI with the deploy [19:20:25] Go for it. [19:20:39] PROBLEM - swift-account-auditor on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:20:47] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer [19:21:02] !log mobrovac@deploy1001 Started deploy [restbase/deploy@7187e0c] (dev-cluster): Bump HTML content version in docs and remove Parsoid stash fall-back [19:21:03] PROBLEM - swift-account-server on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:21:03] PROBLEM - dhclient process on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer [19:21:03] PROBLEM - Check systemd state on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer [19:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:11] PROBLEM - DPKG on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer [19:21:13] PROBLEM - swift-object-auditor on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:21:19] PROBLEM - configured eth on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer [19:21:23] PROBLEM - swift-object-replicator on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:21:23] PROBLEM - swift-object-server on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:21:27] PROBLEM - swift-account-reaper on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:21:31] PROBLEM - swift-account-replicator on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:21:45] PROBLEM - Disk space on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [19:21:45] PROBLEM - MD RAID on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer [19:21:45] PROBLEM - very high load average likely xfs on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:21:59] PROBLEM - swift-container-updater on ms-be2027 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.63: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:22:13] ^^^ ??? [19:22:28] tldr it's fine [19:22:33] lol [19:22:35] RECOVERY - configured eth on ms-be2027 is OK: OK - interfaces up [19:22:37] RECOVERY - swift-object-replicator on ms-be2027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [19:22:37] RECOVERY - swift-object-server on ms-be2027 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift [19:22:41] RECOVERY - swift-account-reaper on ms-be2027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [19:22:45] RECOVERY - swift-account-replicator on ms-be2027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [19:22:57] RECOVERY - Disk space on ms-be2027 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [19:22:57] RECOVERY - very high load average likely xfs on ms-be2027 is OK: OK - load average: 29.63, 29.13, 29.48 https://wikitech.wikimedia.org/wiki/Swift [19:22:57] RECOVERY - MD RAID on ms-be2027 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:23:11] RECOVERY - swift-account-auditor on ms-be2027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [19:23:11] RECOVERY - swift-container-updater on ms-be2027 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [19:23:19] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2027 is OK: OK ferm input default policy is set [19:23:24] we're decommissioning some swift hosts --> a lot of data is moving off of a few of them onto the other hosts --> all of them have much higher than usual I/O load --> a bunch of background stuff (like monitoring checks) is timing out doing I/O [19:23:37] RECOVERY - Check systemd state on ms-be2027 is OK: OK - running: The system is fully operational [19:23:37] RECOVERY - dhclient process on ms-be2027 is OK: PROCS OK: 0 processes with command name dhclient [19:23:37] RECOVERY - swift-account-server on ms-be2027 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [19:23:43] RECOVERY - DPKG on ms-be2027 is OK: All packages OK [19:23:45] RECOVERY - swift-object-auditor on ms-be2027 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [19:23:45] ah ok, thnx for the info cdanis [19:24:12] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@7187e0c] (dev-cluster): Bump HTML content version in docs and remove Parsoid stash fall-back (duration: 03m 10s) [19:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:15] I spent some time looking at if we could make the replication processes have less impact on the other stuff but it is not trivial [19:25:35] !log mobrovac@deploy1001 Started deploy [restbase/deploy@7187e0c]: Bump HTML content version in docs, remove Parsoid stash fall-back and start logging all sections requests - T221432 T215956 T216636 [19:25:41] (03CR) 10CDanis: "Thinking about it more, I still want your input on the ionice changes, but am just going to self-merge the change to regular 'nice'ness. " [puppet] - 10https://gerrit.wikimedia.org/r/506321 (owner: 10CDanis) [19:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:44] T216636: Consider deprecating section editing API in RESTBase - https://phabricator.wikimedia.org/T216636 [19:25:44] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [19:25:45] T221432: New swagger-ui try it out feature returns 406 for some endpoints - https://phabricator.wikimedia.org/T221432 [19:27:26] (03PS1) 10CDanis: swift-object-replicator: nice it [puppet] - 10https://gerrit.wikimedia.org/r/506540 [19:30:47] (03CR) 10Krinkle: [C: 03+1] "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/503349 (owner: 10Esanders) [19:41:27] (03PS1) 10Dzahn: ldap-admins: remove demon, add foks, add admin group on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) [19:43:08] (03PS1) 10Ottomata: refine mediawiki-events - remove use of http proxy after T221690 [puppet] - 10https://gerrit.wikimedia.org/r/506543 (https://phabricator.wikimedia.org/T221690) [19:43:38] (03CR) 10jerkins-bot: [V: 04-1] refine mediawiki-events - remove use of http proxy after T221690 [puppet] - 10https://gerrit.wikimedia.org/r/506543 (https://phabricator.wikimedia.org/T221690) (owner: 10Ottomata) [19:44:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts) - https://phabricator.wikimedia.org/T220860 (10Dzahn) >>! In T220860#5138226, @herron wrote: > Happy to see another approach implemented, but at the sa... [19:44:29] (03PS2) 10Ottomata: refine mediawiki-events - remove use of http proxy after T221690 [puppet] - 10https://gerrit.wikimedia.org/r/506543 (https://phabricator.wikimedia.org/T221690) [19:45:39] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@7187e0c]: Bump HTML content version in docs, remove Parsoid stash fall-back and start logging all sections requests - T221432 T215956 T216636 (duration: 20m 04s) [19:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:48] T216636: Consider deprecating section editing API in RESTBase - https://phabricator.wikimedia.org/T216636 [19:45:48] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [19:45:49] T221432: New swagger-ui try it out feature returns 406 for some endpoints - https://phabricator.wikimedia.org/T221432 [19:45:52] (03PS2) 10Dzahn: ldap-admins: remove demon, add foks, add admin group on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) [19:49:00] (03CR) 10Ottomata: [C: 03+2] refine mediawiki-events - remove use of http proxy after T221690 [puppet] - 10https://gerrit.wikimedia.org/r/506543 (https://phabricator.wikimedia.org/T221690) (owner: 10Ottomata) [19:50:03] mutante, is ldap-admins going to give foks the permissions he needs? [19:50:24] Krenair: so far i assumed he doesn't need to run anything as root [19:50:36] just be able to login on labweb [19:50:39] it will only let him onto the host and let him do some irrelevant stuff [19:50:51] changePassword.php needs to be run with sudo ? [19:51:29] Not directly, but does mwscript? [19:51:33] yes, with sudo [19:51:36] mwscript requires sudo [19:51:38] not to root, but still [19:52:02] ok, in that case i would add a sudo privs line to ldap-admins [19:52:06] no [19:52:18] sudo as which user? [19:52:26] ldap-admins should probably be a separate group [19:52:56] well..the point was that i would disagree with that part [19:53:10] a person who can do one thing should also be able to do the other thing [19:53:15] if it's part of the same worklfow [19:53:17] think it's www-data? [19:53:48] but also.. new groups don't hurt me [19:53:53] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 4 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) [19:54:14] except it makes no sense if we end up doing access requests for always both groups together [19:54:42] afaik the only reason ldap-admins is there is historical stuff from fenari? [19:54:58] yes, it's www-data but also more, i guess [19:54:59] ALL = (www-data,apache,mwdeploy,l10nupdate) NOPASSWD: ALL', [19:55:29] yes, it's an old group but nevertheless i think it matches the request [19:55:37] people who can change passwords [19:55:40] and users [19:56:21] doesnt it make more sense than doing "add to mw deployment group" to solve that? [19:56:40] deploying and user maintenance seem different things to me [19:57:04] they are indeed separate too [19:57:17] and doesn't the workflow that foks needs involve both LDAP change and running that script [19:57:24] speaking of that sudo line, I wonder if www-data,apache (still?) makes sense. [19:57:27] i mean.. would you have one without the other [19:57:57] Krenair: good question [19:58:39] AFAIK there is no requirement for foks to directly modify LDAP itself, only via wikitech/MW? [19:59:08] ok, i guess this is the important point.. does the workflow need both things or not [19:59:28] i wanted to avoid that we have to ping 2 different people for one thing [19:59:48] maybe that could use more input from secteam [20:00:09] if one person needs to do both thing.. one group.. otherwise 2 groups? [20:03:19] (03CR) 10Dzahn: "per IRC discussion, will need at least an extra sudo privileges line to run mwscript.. or we go back to "entirely new admin group" but tha" [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn) [20:04:42] (03CR) 10Alex Monk: "changePassword is supposed to do the LDAP part, as far as I know ldap-admins is intended for people directly modifying LDAP" [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn) [20:07:10] i guess we can just do group "password-changers"..bbiaw [20:07:39] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer [20:32:33] PROBLEM - Check systemd state on ms-be2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:33:35] usual debmonitor blip [20:33:46] ah [20:33:49] was just about to check on it [20:34:06] maybe it shouldn't alert until it's been a few times [20:34:32] that would be nice [20:41:09] PROBLEM - HP RAID on ms-be1030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.26: Connection reset by peer [20:43:35] PROBLEM - swift-object-replicator on ms-be1030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.26: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:44:47] RECOVERY - swift-object-replicator on ms-be1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [20:45:07] PROBLEM - ping-offload grafana alert on icinga1001 is CRITICAL: CRITICAL: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is alerting: target IP missing on hosts loopback. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert [20:46:25] RECOVERY - ping-offload grafana alert on icinga1001 is OK: OK: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is not alerting. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert [20:48:29] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 39.71 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:49:51] hello, anyone know anything about wikipedia.org TLS impl? I do 2 get requests, potentially in the same session via keep-alive, potentially not, but able for session re-use, and it looks to do each request in separate TLS keys. Full handshake for both requests rather than keepalive or session id reuse. [20:50:15] hello jrwren [20:51:07] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 81.01 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:52:53] the expert for these things is BBlack [20:52:59] which could probably answer by heart [20:53:50] this is probably going to varnish [20:54:26] not sure if they have a shared session cache [20:55:41] That makes sense, just surprising because I've only seen it the other way in my tests of very few other sites until now. [20:56:27] do note that there are many backend servers [20:56:34] that may be handling your request [20:57:22] understood. [20:57:43] I expect there to be a shared session [20:57:51] otoh, perhaps it was disabled on purpose [20:58:00] to avoid past-traffic analysis [20:58:11] Another question, if i may, is about MTU and TLS Record size, and please if this makes no sense, please say so, I only realized the connection minutes ago. [20:59:12] I notice every TLS Record fits into 2 TCP packets, the first at full MTU and a second packet with the remaining part of the TLS Record. Is this by design? Is this some MTU path thing where I'm not in an optimal path? other? [21:00:04] very few people around here are likely to know the answers to these questions [21:00:30] that's probably because the cipher list won't enter into a single packet [21:00:50] hmm, wait [21:00:57] you're talking about TLS records [21:01:04] yeah, not the handshake. I'm talking about app data records. [21:01:38] the server handshake is multiple TCP packets and that is the norm on other servers too. Those are large. [21:01:56] I believe there is some SSL session caching [21:01:56] modules/profile/templates/tlsproxy/nginx.conf.erb: ssl_session_cache shared:SSL:1024m; [21:02:06] that's for nginx [21:02:11] but as Platonides says, LVS will be in front of that [21:02:20] but isn't a varnish in front of them? [21:02:23] no [21:02:39] it goes LVS -> nginx -> varnish [21:02:52] oh, it's the other way [21:02:53] nginx doing the TLS termination [21:02:57] varnish cannot [21:03:29] forgive me. after scrolling the pcap I'm looking at, a bit more, I see after some time the behavior I describe changed. It must be that something detected a better MTU and set a TLS Record size limit to match. [21:05:01] there's also this: modules/profile/templates/tlsproxy/nginx.conf.erb: ssl_session_tickets off; [21:05:13] it looks like we do use dynamic sizing of TLS records in nginx -- modules/profile/templates/tlsproxy/nginx.conf.erb: ssl_dyn_rec_enable on; # cf patch default: off [21:05:20] err... rather, at least it stopped flushing TCP packets with each record and started filling tcp packets. [21:05:21] commented 'Disable RFC5077 tickets (may revisit later when client support is better)' [21:05:40] I believe it's the same code as described at https://blog.cloudflare.com/optimizing-tls-over-tcp-to-reduce-latency/ [21:06:11] Ok, so what I observed is maybe a bit of an anomoly, but ramped up to something correct. [21:06:14] Thanks for taking the time. [21:06:16] now that RFC seems to be about TLS session resumption which is probably what you were looking for [21:06:38] A bit of both. [21:06:44] jrwren: if you need to dig into more-specific details, it might be better async via phabricator tasks [21:06:56] I can answer some of this, but it's kind of a hectic day/week and there's a lot else going on [21:07:19] Thanks, I think this high level BG is enough for me. [21:09:13] RECOVERY - Check systemd state on ms-be2032 is OK: OK - running: The system is fully operational [21:12:01] !log T221516 running mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'FoldDownPro' 'MichaelOBFDP' [21:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:06] T221516: Please unblock stuck global rename - https://phabricator.wikimedia.org/T221516 [21:13:59] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 10 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [21:15:53] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:17:08] (03CR) 10Andrew Bogott: [C: 03+2] Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506428 (https://phabricator.wikimedia.org/T221857) (owner: 10Andrew Bogott) [21:17:19] (03PS3) 10Andrew Bogott: Move labcontrol1001/1002 to role::spare and clean up references [puppet] - 10https://gerrit.wikimedia.org/r/506345 (https://phabricator.wikimedia.org/T221817) [21:17:28] (03PS3) 10Andrew Bogott: Move labnet1001, 1002 to role::spare, clean up other references [puppet] - 10https://gerrit.wikimedia.org/r/506346 (https://phabricator.wikimedia.org/T221818) [21:17:38] (03PS4) 10Andrew Bogott: Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506428 (https://phabricator.wikimedia.org/T221857) [21:20:21] yay! [21:21:04] (03CR) 10Andrew Bogott: [C: 03+2] Move labcontrol1001/1002 to role::spare and clean up references [puppet] - 10https://gerrit.wikimedia.org/r/506345 (https://phabricator.wikimedia.org/T221817) (owner: 10Andrew Bogott) [21:21:15] (03CR) 10Andrew Bogott: [C: 03+2] Move labnet1001, 1002 to role::spare, clean up other references [puppet] - 10https://gerrit.wikimedia.org/r/506346 (https://phabricator.wikimedia.org/T221818) (owner: 10Andrew Bogott) [21:23:25] (03PS1) 10Dzahn: raid: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/506548 (https://phabricator.wikimedia.org/T197873) [21:24:02] (03CR) 10jerkins-bot: [V: 04-1] raid: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/506548 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [21:24:33] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) For sessions. >>! From T211488#5102911: > `lang=diff, > -session.save_path = > +session.s... [21:25:56] (03PS2) 10Dzahn: raid: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/506548 (https://phabricator.wikimedia.org/T197873) [21:26:46] !log revoking M5 grants as per https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/506428/4/modules/role/templates/mariadb/grants/production-m5.sql.erb and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/506345/3/modules/role/templates/mariadb/grants/production-m5.sql.erb [21:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:49] M5: Where should the Wikimedia usernames appear - https://phabricator.wikimedia.org/M5 [21:27:54] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) [21:29:23] 10Operations, 10Performance-Team, 10PHP 7.2 support, 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) 05Open→03Resolved a:03Krinkle [21:29:35] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [21:29:44] 10Operations, 10Performance-Team, 10PHP 7.2 support, 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) The mysql settings have been checked by aaron, jcrespo and joe. [21:30:12] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [21:31:01] (03PS1) 10Dzahn: lvs: add runbook for check_rp_filter_disabled [puppet] - 10https://gerrit.wikimedia.org/r/506549 [21:31:18] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [21:31:32] !log stopping nova services on labnet1001/1002 [21:31:47] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 259, down: 4, shutdown: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:31:48] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) Removing item for checking php70 and php71 issues. Any of those affecting php72, I've co-tagged with #php_7... [21:33:35] (03PS1) 10Dzahn: base::firewall: add runbooks for check_ferm and check_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/506550 [21:36:50] (03PS1) 10Dzahn: installserver: add icinga runbook for tftpd down [puppet] - 10https://gerrit.wikimedia.org/r/506551 [21:38:58] 10Operations, 10decommission, 10Patch-For-Review: Decommission labnet1001, 1002 - https://phabricator.wikimedia.org/T221818 (10Andrew) [21:44:37] (03PS1) 10Dzahn: kafkatee: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506557 (https://phabricator.wikimedia.org/T194724) [21:49:30] PROBLEM - HP RAID on ms-be2032 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.14: Connection reset by peer [21:49:38] (03PS1) 10Dzahn: mariadb-eventlogging-repl: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506558 (https://phabricator.wikimedia.org/T194724) [21:50:02] !log icinga-downtime -h ms-be2032 -r swift-rebalancing -d 86400 [21:55:31] (03PS1) 10Dzahn: memcached: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506559 (https://phabricator.wikimedia.org/T194724) [21:55:59] (03CR) 10jerkins-bot: [V: 04-1] memcached: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506559 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [21:57:11] (03PS2) 10Dzahn: memcached: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506559 (https://phabricator.wikimedia.org/T194724) [21:58:08] PROBLEM - swift-container-server on ms-be2024 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.60: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:58:34] PROBLEM - MD RAID on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer [21:58:34] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer [21:59:03] machine overloaded? [21:59:05] !log icinga-downtime -h ms-be2019 -r swift-rebalancing -d 86400 [21:59:08] RECOVERY - swift-container-server on ms-be2024 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [21:59:11] ah [21:59:19] the swift machines are just going to do this for a while [21:59:22] :\ [21:59:22] chaomodus: yea, it's the same thing as the other swift machines [21:59:24] i figured [21:59:25] I am not sure what to do about it [21:59:34] RECOVERY - MD RAID on ms-be2019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:59:35] downtime them all? :) [21:59:38] I kicked off a new round of lowered-weight-on-machines-to-be-decommed this morning [22:00:20] my first thought was to ionice the rsync / object-replicator processes, but that won't do anything because we use the 'deadline' I/O scheduler everywhere, which cares not for ionice [22:01:10] if you know the range ..could do something like "for host in ms-be20$(seq .. ; do icinga-downtime .. [22:01:21] it will be all the mse-be machines [22:01:23] ms-be* [22:01:30] in both eqiad and codfw (machines being decommed in both) [22:01:43] I am kind of loathe to downtime everything on those hosts though -- we do care that they are up and working [22:01:49] hmm.. then i tend to say let's just leave it as it is [22:01:59] yes, agree [22:03:38] RECOVERY - puppet last run on ms-be2019 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [22:05:10] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:05:27] 10Operations, 10media-storage: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10CDanis) [22:18:38] (03PS1) 10Andrew Bogott: Revert "Move labnet1001, 1002 to role::spare, clean up other references" [puppet] - 10https://gerrit.wikimedia.org/r/506563 [22:19:20] (03PS2) 10Andrew Bogott: Revert "Move labnet1001, 1002 to role::spare, clean up other references" [puppet] - 10https://gerrit.wikimedia.org/r/506563 [22:20:31] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Move labnet1001, 1002 to role::spare, clean up other references" [puppet] - 10https://gerrit.wikimedia.org/r/506563 (owner: 10Andrew Bogott) [22:22:41] (03PS1) 10Andrew Bogott: Move labnet1001, 1002 to role::spare, clean up other references [puppet] - 10https://gerrit.wikimedia.org/r/506564 [22:26:40] (03PS1) 10Andrew Bogott: Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506565 [22:27:09] (03PS2) 10Andrew Bogott: Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506565 [22:28:14] (03CR) 10Andrew Bogott: [C: 03+2] Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506565 (owner: 10Andrew Bogott) [22:29:51] (03PS1) 10Andrew Bogott: Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506566 [22:32:14] RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational [22:32:24] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:32:37] noise monster [22:32:58] lol [22:37:09] (03PS2) 10Andrew Bogott: Move labnet1001, 1002 to role::spare, clean up other references [puppet] - 10https://gerrit.wikimedia.org/r/506564 [22:38:16] (03CR) 10Andrew Bogott: [C: 03+2] Move labnet1001, 1002 to role::spare, clean up other references [puppet] - 10https://gerrit.wikimedia.org/r/506564 (owner: 10Andrew Bogott) [22:48:05] 10Operations, 10decommission, 10Patch-For-Review: Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10Andrew) [22:49:55] 10Operations, 10decommission, 10Patch-For-Review: Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10Andrew) a:05Andrew→03RobH @robh, I'm supposed to assign decom hosts to you at this point, right? [22:50:08] 10Operations, 10decommission, 10Patch-For-Review: Decommission labnet1001, 1002 - https://phabricator.wikimedia.org/T221818 (10Andrew) a:05Andrew→03RobH [22:50:22] https://phabricator.wikimedia.org/T221817 yes [22:50:27] andrewbogott: https://phabricator.wikimedia.org/T221817 yes =] [22:50:51] basically you dont wanan disable pupet and power anything down unless you can also IMMEDIATELY disable the switch port =] [22:51:02] ok, that's what I thought. Thanks [22:51:10] four more trusty hosts on the chopping block! [22:51:24] oh, i'll kill them right away [22:53:02] 10Operations, 10decommission: Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10RobH) [22:54:32] 10Operations, 10decommission: Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10RobH) labcontrol1001:asw2-c-eqiad:ge-5/0/32 labcontrol1002:asw2-c-eqiad:ge-7/0/26 [22:55:08] 10Operations, 10decommission: Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10RobH) >>! In T221817#5139201, @Andrew wrote: > @robh, I'm supposed to assign decom hosts to you at this point, right? Yep! Basically once we get to the point of disabling puppet and powering down a... [22:56:36] (03PS2) 10Dzahn: installserver: add icinga runbook for tftpd down [puppet] - 10https://gerrit.wikimedia.org/r/506551 [22:56:45] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [22:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:49] (03PS3) 10Dzahn: installserver: add icinga runbook for tftpd down [puppet] - 10https://gerrit.wikimedia.org/r/506551 (https://phabricator.wikimedia.org/T197873) [22:56:51] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [22:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:57] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [22:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:01] 10Operations, 10decommission: Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labcontrol1001.wikimedia.org` - labcontrol1001.wikimedia.org - Removed from Puppet master and PuppetDB... [22:57:03] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [22:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:06] 10Operations, 10decommission: Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labcontrol1002.wikimedia.org` - labcontrol1002.wikimedia.org - Removed from Puppet master and PuppetDB... [22:57:29] 10Operations, 10ops-eqiad: Degraded RAID on labcontrol1002 - https://phabricator.wikimedia.org/T221909 (10ops-monitoring-bot) [22:58:45] 10Operations, 10ops-eqiad: Degraded RAID on labcontrol1001 - https://phabricator.wikimedia.org/T221910 (10ops-monitoring-bot) [22:59:02] 10Operations, 10ops-eqiad: Degraded RAID on labcontrol1001 - https://phabricator.wikimedia.org/T221911 (10ops-monitoring-bot) [22:59:26] 10Operations, 10ops-eqiad: Degraded RAID on labcontrol1002 - https://phabricator.wikimedia.org/T221912 (10ops-monitoring-bot) [22:59:30] 10Operations, 10ops-eqiad, 10decommission: Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10RobH) p:05Triage→03Normal [23:00:04] MaxSem, RoanKattouw, and Niharika: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190425T2300). Please do the needful. [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:10] (03PS1) 10RobH: labcontrol100[12] production dns decom [dns] - 10https://gerrit.wikimedia.org/r/506567 (https://phabricator.wikimedia.org/T221817) [23:01:11] (03CR) 10Dzahn: "https://wikitech.wikimedia.org/wiki/Monitoring/atftpd" [puppet] - 10https://gerrit.wikimedia.org/r/506551 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:01:35] (03CR) 10Dzahn: [C: 03+2] installserver: add icinga runbook for tftpd down [puppet] - 10https://gerrit.wikimedia.org/r/506551 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:01:43] (03PS4) 10Dzahn: installserver: add icinga runbook for tftpd down [puppet] - 10https://gerrit.wikimedia.org/r/506551 (https://phabricator.wikimedia.org/T197873) [23:01:57] (03CR) 10RobH: [C: 03+2] labcontrol100[12] production dns decom [dns] - 10https://gerrit.wikimedia.org/r/506567 (https://phabricator.wikimedia.org/T221817) (owner: 10RobH) [23:04:10] (03PS1) 10RobH: decom labcontrol100[12] [puppet] - 10https://gerrit.wikimedia.org/r/506569 (https://phabricator.wikimedia.org/T221817) [23:04:58] robh: i made a change to add Icinga URLs to the RAID checks. we have like 5 different controllers (for hwraid). i used 2 different links among them, one for megacli and one for the dcops troubleshooting page. feel like reviewing? https://gerrit.wikimedia.org/r/c/operations/puppet/+/506548 [23:05:24] it's just about the additional info URLs not what the checks do [23:06:03] (03CR) 10RobH: [C: 03+2] decom labcontrol100[12] [puppet] - 10https://gerrit.wikimedia.org/r/506569 (https://phabricator.wikimedia.org/T221817) (owner: 10RobH) [23:06:12] (03PS2) 10RobH: decom labcontrol100[12] [puppet] - 10https://gerrit.wikimedia.org/r/506569 (https://phabricator.wikimedia.org/T221817) [23:06:37] mutante: yeah can check it out, im not sure if having more than one link is ideal or not, but also not sure im up to writing more wikitech docs to combine ;D [23:06:46] today that is, heh up to combining them today [23:07:12] yea, this was best effort, i searched for the types and the megacli specific ones had their own page already [23:07:42] i can also amend it to use the same URL for all and then just link there "also see [[MegaCLI]] [23:07:50] anyways.. no rush [23:09:45] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10mobrovac) [23:12:12] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:16:18] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) Hi again @wiki_willy, for the remaining step "access to pwstore", so that you can read the shared passwords, we will have to go through the steps described on the [[... [23:16:58] (03PS1) 10CRusnov: puppetdb_microservice: Add acceptable facts [puppet] - 10https://gerrit.wikimedia.org/r/506570 [23:21:55] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) [23:24:11] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational [23:26:13] 10Operations, 10ops-eqiad, 10decommission: Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10RobH) [23:27:01] 10Operations, 10ops-eqiad, 10decommission: Decommission labcontrol1001, 1002 - https://phabricator.wikimedia.org/T221817 (10RobH) a:05RobH→03Cmjohnson [23:30:52] (03PS1) 10Dzahn: icinga/nagios_common: add Willy Pao to group misleadingly called 'sms' [puppet] - 10https://gerrit.wikimedia.org/r/506571 (https://phabricator.wikimedia.org/T221142) [23:31:10] 10Operations, 10decommission: Decommission labnet1001, 1002 - https://phabricator.wikimedia.org/T221818 (10RobH) [23:32:03] (03PS2) 10Dzahn: icinga/nagios_common: add Willy Pao to group misleadingly called 'sms' [puppet] - 10https://gerrit.wikimedia.org/r/506571 (https://phabricator.wikimedia.org/T221142) [23:33:14] 10Operations, 10decommission: Decommission labnet1001, 1002 - https://phabricator.wikimedia.org/T221818 (10RobH) asw2-b-eqiad: robh@asw2-b-eqiad> show interfaces descriptions | grep labnet xe-2/0/22 up up labnet1001:eth1 xe-2/0/24 up up labnet1001:eth0 xe-4/0/44 up up labnet1... [23:33:47] (03CR) 10Dzahn: [C: 03+2] icinga/nagios_common: add Willy Pao to group misleadingly called 'sms' [puppet] - 10https://gerrit.wikimedia.org/r/506571 (https://phabricator.wikimedia.org/T221142) (owner: 10Dzahn) [23:41:05] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [23:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:11] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [23:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:17] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [23:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:23] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [23:41:24] 10Operations, 10decommission: Decommission labnet1001, 1002 - https://phabricator.wikimedia.org/T221818 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labnet1001.eqiad.wmnet` - labnet1001.eqiad.wmnet - Removed from Puppet master and PuppetDB - Downtimed host... [23:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:27] 10Operations, 10decommission: Decommission labnet1001, 1002 - https://phabricator.wikimedia.org/T221818 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labnet1002.eqiad.wmnet` - labnet1002.eqiad.wmnet - Removed from Puppet master and PuppetDB - Downtimed host... [23:42:58] (03PS1) 10RobH: decom labnet100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/506574 (https://phabricator.wikimedia.org/T221818) [23:43:23] PROBLEM - Check systemd state on ms-be1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:46:25] (03PS1) 10RobH: labnet100[12] decom [puppet] - 10https://gerrit.wikimedia.org/r/506575 (https://phabricator.wikimedia.org/T221818) [23:46:43] (03CR) 10RobH: [C: 03+2] labnet100[12] decom [puppet] - 10https://gerrit.wikimedia.org/r/506575 (https://phabricator.wikimedia.org/T221818) (owner: 10RobH) [23:47:09] RECOVERY - Check systemd state on ms-be1020 is OK: OK - running: The system is fully operational [23:47:57] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:48:12] (03PS1) 10Dzahn: icinga: give global host/service/command privileges to Willy Pao [puppet] - 10https://gerrit.wikimedia.org/r/506576 (https://phabricator.wikimedia.org/T221142) [23:48:46] 10Operations, 10ops-eqiad, 10decommission: Decommission labnet1001, 1002 - https://phabricator.wikimedia.org/T221818 (10RobH) p:05Triage→03Normal a:05RobH→03Cmjohnson [23:48:56] 10Operations, 10ops-eqiad, 10decommission: Decommission labnet1001 & labnet1002 - https://phabricator.wikimedia.org/T221818 (10RobH) [23:49:32] 10Operations, 10ops-eqiad, 10decommission: Decommission labcontrol1001 & labcontrol1002 - https://phabricator.wikimedia.org/T221817 (10RobH) [23:49:52] (03CR) 10Dzahn: [C: 03+2] icinga: give global host/service/command privileges to Willy Pao [puppet] - 10https://gerrit.wikimedia.org/r/506576 (https://phabricator.wikimedia.org/T221142) (owner: 10Dzahn) [23:50:00] (03PS2) 10Dzahn: icinga: give global host/service/command privileges to Willy Pao [puppet] - 10https://gerrit.wikimedia.org/r/506576 (https://phabricator.wikimedia.org/T221142) [23:52:04] 10Operations, 10ops-eqiad, 10decommission: Decommission labcontrol1001 & labcontrol1002 - https://phabricator.wikimedia.org/T221817 (10RobH) [23:52:39] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10RobH) [23:53:09] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10RobH) [23:53:35] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10RobH) [23:54:07] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission osm-db200[12] and osm-web200[1234] - https://phabricator.wikimedia.org/T187445 (10RobH) [23:54:14] (03CR) 10Papaul: [C: 03+1] raid: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/506548 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:54:30] 10Operations, 10ops-codfw, 10Traffic, 10decommission: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286 (10RobH) [23:54:42] 10Operations, 10ops-codfw, 10Reading-Infrastructure-Team-Backlog, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10RobH) [23:56:11] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) [23:57:27] (03PS1) 10Dzahn: icinga: remove Katie Horn from host/service/command privileges [puppet] - 10https://gerrit.wikimedia.org/r/506578