[00:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T0000). [00:00:38] (03PS1) 10Ayounsi: Icinga, assign bfd check to routers [puppet] - 10https://gerrit.wikimedia.org/r/461498 (https://phabricator.wikimedia.org/T83992) [00:02:08] (03PS2) 10Dzahn: Revert "profile::mediawiki::maintenance: depend on mediawiki config, not hiera" [puppet] - 10https://gerrit.wikimedia.org/r/461496 [00:02:28] (03CR) 10jerkins-bot: [V: 04-1] Revert "profile::mediawiki::maintenance: depend on mediawiki config, not hiera" [puppet] - 10https://gerrit.wikimedia.org/r/461496 (owner: 10Dzahn) [00:02:31] (03PS3) 10Dzahn: Revert "profile::mediawiki::maintenance: depend on mediawiki config, not hiera" [puppet] - 10https://gerrit.wikimedia.org/r/461496 [00:02:47] (03CR) 10jerkins-bot: [V: 04-1] Revert "profile::mediawiki::maintenance: depend on mediawiki config, not hiera" [puppet] - 10https://gerrit.wikimedia.org/r/461496 (owner: 10Dzahn) [00:05:47] (03PS4) 10Dzahn: Revert "profile::mediawiki::maintenance: depend on mediawiki config, not hiera" [puppet] - 10https://gerrit.wikimedia.org/r/461496 [00:07:55] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) [00:08:00] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) 05Open>03stalled for now blocked on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/461496/ [00:10:23] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/12509/" [puppet] - 10https://gerrit.wikimedia.org/r/461498 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [00:16:26] (03CR) 10Dzahn: "yes, standardizing the rewrites is nice and this is a closed wiki since 2011" [puppet] - 10https://gerrit.wikimedia.org/r/452635 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:17:46] (03CR) 10Dzahn: "eh, i didn't expect these other sites besides usability to show up in this compiler run: https://puppet-compiler.wmflabs.org/compiler1002/" [puppet] - 10https://gerrit.wikimedia.org/r/452635 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:18:57] (03CR) 10Dzahn: "i got the order of changes wrong or there are a bit weird dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/452635 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:24:19] (03PS2) 10Dzahn: mediawiki::web::prod_sites: reorganize code a bit [puppet] - 10https://gerrit.wikimedia.org/r/461393 (owner: 10Giuseppe Lavagetto) [00:32:48] (03PS3) 10Dzahn: mediawiki::web::prod_sites: reorganize code a bit [puppet] - 10https://gerrit.wikimedia.org/r/461393 (owner: 10Giuseppe Lavagetto) [00:32:56] (03CR) 10Dzahn: "rebasing so it's not based on the "convert loginwiki" one because:" [puppet] - 10https://gerrit.wikimedia.org/r/461393 (owner: 10Giuseppe Lavagetto) [00:34:43] (03CR) 10Krinkle: [C: 04-1] "This looks substantial:" [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:34:53] (03CR) 10Dzahn: [C: 031] "noop now: https://puppet-compiler.wmflabs.org/compiler1002/12512/" [puppet] - 10https://gerrit.wikimedia.org/r/461393 (owner: 10Giuseppe Lavagetto) [00:37:40] (03CR) 10Dzahn: [C: 032] "no change to Apache itself, only moving it within puppet" [puppet] - 10https://gerrit.wikimedia.org/r/461393 (owner: 10Giuseppe Lavagetto) [00:39:44] (03PS1) 10Ayounsi: SNMP: set snmp-mibs-downloader BASEDIR to Debian 9 standard [puppet] - 10https://gerrit.wikimedia.org/r/461503 (https://phabricator.wikimedia.org/T83992) [00:45:51] * Krinkle staging on mwdebug2002 [00:48:12] (03CR) 10Dzahn: [C: 032] "no change confirmed on mwdebug2001/1001.. then re-enabled puppet on mw2*" [puppet] - 10https://gerrit.wikimedia.org/r/461393 (owner: 10Giuseppe Lavagetto) [00:49:01] !log temp disabled puppet on mw2*, deployed gerrit 461393, confirmed noop, re-enabled puppet on mw2* [00:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:52] (03CR) 10Ayounsi: "Out of the different ways of fixing the issue I think this is the cleanest." [puppet] - 10https://gerrit.wikimedia.org/r/461503 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [00:52:29] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/3D/modules/: I9e718957a497, T204621 (duration: 00m 58s) [00:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:36] T204621: Reduce footprint of ext.3d on page initialisation - https://phabricator.wikimedia.org/T204621 [00:59:50] RECOVERY - Check systemd state on elastic1025 is OK: OK - running: The system is fully operational [01:05:03] (03PS2) 10Dzahn: mediawiki::web::prod_sites: convert foundation.w.o to use vhost [puppet] - 10https://gerrit.wikimedia.org/r/461394 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:14:45] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/GlobalPreferences/includes/GlobalPreferencesFactory.php: I52434a523f60e, T204864 (duration: 00m 58s) [01:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:53] T204864: [1.32.0-wmf.22] includes/GlobalPreferencesFactory.php: PHP Notice: Undefined index: section - https://phabricator.wikimedia.org/T204864 [01:19:02] PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [01:21:21] RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.016 second response time [01:25:30] (03CR) 10Dzahn: "test URLs for foundation wiki rewrites:" [puppet] - 10https://gerrit.wikimedia.org/r/461394 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:25:57] (03CR) 10Dzahn: "result before:" [puppet] - 10https://gerrit.wikimedia.org/r/461394 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:27:08] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.22/resources/src/startup/: I6c77b25856 (duration: 00m 55s) [01:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:45] (03CR) 10Dzahn: "result after:" [puppet] - 10https://gerrit.wikimedia.org/r/461394 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:28:02] (03CR) 10Dzahn: [C: 04-1] "http://wikimediafoundation.org" [puppet] - 10https://gerrit.wikimedia.org/r/461394 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:30:43] (03CR) 10Krinkle: mediawiki::web::prod_sites: convert foundation.w.o to use vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/461394 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [02:21:21] 10Operations, 10wikitech.wikimedia.org, 10Documentation: Update wasat/mwmaint2001 docs on Wikitech - https://phabricator.wikimedia.org/T204389 (10Dzahn) https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/mwmaint2001.codfw.wmnet https://wikitech.wikimedia.org/w/index.php?title=Mwmaint2001&type=revisio... [02:22:20] 10Operations, 10ops-codfw, 10netops: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10Dzahn) [02:22:22] 10Operations, 10wikitech.wikimedia.org, 10Documentation: Update wasat/mwmaint2001 docs on Wikitech - https://phabricator.wikimedia.org/T204389 (10Dzahn) [02:23:32] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.20) (duration: 09m 26s) [02:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:06] 10Operations, 10wikitech.wikimedia.org, 10Documentation: Update wasat/mwmaint2001 docs on Wikitech - https://phabricator.wikimedia.org/T204389 (10Dzahn) Wiki part already done by Krinkle. I pasted the fingerprints. Got the fingerprints by running "gen_fingerprints" on the host. done? more? [02:56:05] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.22) (duration: 13m 54s) [02:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:53] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Sep 20 03:06:52 UTC 2018 (duration 10m 47s) [03:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:14:45] (03CR) 10Mathew.onipe: Added force shard allocation to elasticsearch_cluster (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [03:14:50] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [03:16:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:17:30] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:18:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:19:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:24:59] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:25:27] (03PS31) 10Mathew.onipe: Add elasticsearch_cluster module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [03:25:29] (03PS2) 10Mathew.onipe: Added force shard allocation to elasticsearch_cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) [03:25:40] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [03:26:36] (03CR) 10jerkins-bot: [V: 04-1] Added force shard allocation to elasticsearch_cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [03:27:15] (03CR) 10jerkins-bot: [V: 04-1] Add elasticsearch_cluster module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [03:32:21] (03CR) 10Mathew.onipe: Add elasticsearch_cluster module. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [03:41:11] (03PS32) 10Mathew.onipe: Add elasticsearch_cluster module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [03:41:13] (03PS3) 10Mathew.onipe: Added force shard allocation to elasticsearch_cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) [03:42:19] (03CR) 10jerkins-bot: [V: 04-1] Added force shard allocation to elasticsearch_cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [03:42:57] (03CR) 10jerkins-bot: [V: 04-1] Add elasticsearch_cluster module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [03:45:09] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:56:40] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on einsteinium is OK: (C)130 ge (W)110 ge 106.3 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [05:06:57] (03PS1) 10Marostegui: db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461511 [05:08:51] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461511 (owner: 10Marostegui) [05:10:39] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461511 (owner: 10Marostegui) [05:11:54] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2069 for alter table (duration: 01m 00s) [05:11:56] !log Deploy schema change on db2069 [05:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:59] (03CR) 10jenkins-bot: db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461511 (owner: 10Marostegui) [05:22:45] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461512 [05:26:05] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461512 (owner: 10Marostegui) [05:27:40] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461512 (owner: 10Marostegui) [05:28:52] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2069 after alter table (duration: 00m 57s) [05:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:08] (03PS1) 10Marostegui: db-codfw.php: db2034 stop reads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461513 [05:30:34] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461512 (owner: 10Marostegui) [05:30:49] (03CR) 10Marostegui: [C: 032] db-codfw.php: db2034 stop reads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461513 (owner: 10Marostegui) [05:32:27] (03Merged) 10jenkins-bot: db-codfw.php: db2034 stop reads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461513 (owner: 10Marostegui) [05:33:58] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Stop reads on db2034 (duration: 00m 57s) [05:34:01] !log Deploy schema change on db2034 (x1 master) [05:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:26] (03PS1) 10Marostegui: Revert "db-codfw.php: db2034 stop reads" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461514 [05:44:38] (03CR) 10jenkins-bot: db-codfw.php: db2034 stop reads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461513 (owner: 10Marostegui) [05:46:29] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: db2034 stop reads" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461514 (owner: 10Marostegui) [05:48:16] (03Merged) 10jenkins-bot: Revert "db-codfw.php: db2034 stop reads" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461514 (owner: 10Marostegui) [05:48:49] PROBLEM - Restbase root url on restbase2001 is CRITICAL: HTTP CRITICAL - No data received from host [05:49:30] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Resume reads on db2034 (duration: 00m 57s) [05:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:50] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.116 second response time [05:54:08] !log Deploy schema change on s3:mediawikiwiki for echo tables codfw - T51593 [05:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:16] T51593: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 [05:58:53] (03CR) 10jenkins-bot: Revert "db-codfw.php: db2034 stop reads" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461514 (owner: 10Marostegui) [06:01:02] (03PS1) 10Marostegui: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461516 [06:04:57] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461516 (owner: 10Marostegui) [06:06:05] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461516 (owner: 10Marostegui) [06:07:15] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2070 (duration: 00m 58s) [06:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:40] (03CR) 10Giuseppe Lavagetto: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/461394 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:09:37] !log Deploy schema change on db2070 - T203709 [06:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:44] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [06:10:45] (03CR) 10Giuseppe Lavagetto: "> Patch Set 9: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:13:00] (03CR) 10jenkins-bot: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461516 (owner: 10Marostegui) [06:14:47] marostegui: still deploying? i would need 5-10 mins when you have a break :P [06:14:54] mobrovac: Go ahead! :) [06:15:13] (03PS3) 10Mobrovac: RPC/RunSingleJob.php - send X-Readonly header. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461229 (https://phabricator.wikimedia.org/T204154) (owner: 10Ppchelko) [06:15:22] thnx [06:15:28] * mobrovac is doing [06:16:33] <_joe_> mobrovac is not doing [06:16:41] hahahahaha [06:16:48] (03CR) 10Mobrovac: [C: 032] RPC/RunSingleJob.php - send X-Readonly header. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461229 (https://phabricator.wikimedia.org/T204154) (owner: 10Ppchelko) [06:18:33] (03Merged) 10jenkins-bot: RPC/RunSingleJob.php - send X-Readonly header. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461229 (https://phabricator.wikimedia.org/T204154) (owner: 10Ppchelko) [06:20:29] !log mobrovac@deploy1001 Synchronized rpc/RunSingleJob.php: Have RunSingleJob send the X-Readonly header - T204154 (duration: 00m 58s) [06:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:36] T204154: Kafka JobQueue should respect DB readonly mode - https://phabricator.wikimedia.org/T204154 [06:21:24] k marostegui, done the doing [06:21:29] thnx [06:21:41] mobrovac: Great! I will not deploy in a while now [06:21:42] Thanks [06:23:27] <_joe_> mobrovac: adding a header is a bit too respectful of HTTP's spirit; you should've returned 200 with a body with a json encapsulating the error [06:23:42] <_joe_> sorry, it's the jobqueue - the body should be serialized php then [06:23:50] <_joe_> just for consistency [06:23:53] <_joe_> :P [06:24:14] i know, what we are doing is outrageous _joe_ [06:24:20] i don't know if we can live with that [06:25:15] !log mobrovac@deploy1001 Started restart [cpjobqueue/deploy@58f9ed3]: Reset the Kafka connections [06:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:23] <_joe_> mobrovac: I wasn't done; then you create a service that can cache the responses from the jobrunners and transforms said responses into a REST api [06:25:28] <_joe_> :D [06:26:16] no _joe_, i send it to kafka raw and then have a consumer reading the responses and converting them into json and send them to said service for storage with a ttl [06:26:39] <_joe_> ahahaha [06:27:23] (03CR) 10jenkins-bot: RPC/RunSingleJob.php - send X-Readonly header. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461229 (https://phabricator.wikimedia.org/T204154) (owner: 10Ppchelko) [06:29:27] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) These are the last 24h: https://logstash.wikimedia.org/goto/cd0af28f39b7ad679b9d1e1130636fdf Errors are almost... [06:31:21] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/bash/puppet-common.sh] [06:37:21] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461519 [06:40:18] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461519 (owner: 10Marostegui) [06:41:25] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461519 (owner: 10Marostegui) [06:41:48] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461519 (owner: 10Marostegui) [06:42:36] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2070 (duration: 00m 58s) [06:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:39] RECOVERY - IPsec on mc2021 is OK: Strongswan OK - 1 ESP OK [06:45:32] 10Operations, 10ops-codfw, 10netops: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10Smalyshev) [06:45:35] 10Operations, 10wikitech.wikimedia.org, 10Documentation: Update wasat/mwmaint2001 docs on Wikitech - https://phabricator.wikimedia.org/T204389 (10Smalyshev) 05Open>03Resolved a:03Smalyshev [06:56:41] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:06] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10jcrespo) That looks really bad performance. Not only that scans the revision table from top to bottom (>200GB of data)... [07:10:02] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461522 [07:11:11] (03CR) 10Jcrespo: [C: 031] "I don't think you really want to "revert" this, what you want to do (I guess) is hardcode the maintenance execution host so you can do mai" [puppet] - 10https://gerrit.wikimedia.org/r/457492 (owner: 10Giuseppe Lavagetto) [07:12:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461522 (owner: 10Marostegui) [07:12:32] (03PS3) 10Jcrespo: mariadb: add mwmaint1002 to grants for production-m5 [puppet] - 10https://gerrit.wikimedia.org/r/461493 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [07:13:14] (03CR) 10Jcrespo: [C: 032] mariadb: add mwmaint1002 to grants for production-m5 [puppet] - 10https://gerrit.wikimedia.org/r/461493 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [07:13:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461522 (owner: 10Marostegui) [07:14:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1004 (duration: 00m 57s) [07:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:02] !log Stop MySQL on pc1004 for kernel upgrade [07:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:31] RECOVERY - Check size of conntrack table on mc1021 is OK: OK: nf_conntrack is 0 % full [07:16:31] RECOVERY - Check systemd state on mc1021 is OK: OK - running: The system is fully operational [07:18:18] (03CR) 10Jcrespo: [C: 032] "Deployed to m5, please send another patch when 1001 is decommmed." [puppet] - 10https://gerrit.wikimedia.org/r/461493 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [07:20:01] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461524 [07:21:41] RECOVERY - Confd template for /etc/redis/replica/6379-state.conf on mc1021 is OK: No errors detected [07:21:42] RECOVERY - Memcached on mc1021 is OK: TCP OK - 0.000 second response time on 10.64.0.82 port 11211 [07:21:42] RECOVERY - Check whether ferm is active by checking the default input chain on mc1021 is OK: OK ferm input default policy is set [07:21:42] RECOVERY - Disk space on mc1021 is OK: DISK OK [07:21:42] RECOVERY - dhclient process on mc1021 is OK: PROCS OK: 0 processes with command name dhclient [07:21:43] RECOVERY - DPKG on mc1021 is OK: All packages OK [07:21:43] RECOVERY - configured eth on mc1021 is OK: OK - interfaces up [07:21:44] RECOVERY - MD RAID on mc1021 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [07:21:44] RECOVERY - confd service on mc1021 is OK: OK - confd is active [07:21:44] RECOVERY - Check health of redis instance on 6379 on mc1021 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 2245915 keys, up 23 minutes 4 seconds - replication_delay is 0 [07:21:45] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures [07:21:51] (03CR) 10Filippo Giunchedi: "hah! It'd be nice if we catched these at compilation time (I'm assuming the catalog compiled fine) via a couple of validate_re" [puppet] - 10https://gerrit.wikimedia.org/r/461461 (owner: 10Ayounsi) [07:22:05] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool pc1004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461524 (owner: 10Marostegui) [07:23:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461524 (owner: 10Marostegui) [07:24:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461522 (owner: 10Marostegui) [07:24:17] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461524 (owner: 10Marostegui) [07:25:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1004 (duration: 00m 57s) [07:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:35] (03PS1) 10Jcrespo: mariadb: Share API load between db2060 and 67 to avoid overloads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461528 [07:30:12] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10Bawolff) >>! In T202596#4600586, @jcrespo wrote: > That looks really bad performance. Not only that scans the revision... [07:31:18] !log repair sdl on ms-be2042 - T199198 [07:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:27] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [07:39:01] (03CR) 10Marostegui: [C: 031] mariadb: Share API load between db2060 and 67 to avoid overloads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461528 (owner: 10Jcrespo) [07:40:36] (03PS1) 10Muehlenhoff: Revert "Remove mc1021 from mcrouter" [puppet] - 10https://gerrit.wikimedia.org/r/461530 [07:41:09] (03CR) 1020after4: [C: 031] Drop legacy SSHv1 support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458227 (owner: 10Faidon Liambotis) [07:41:57] (03CR) 1020after4: [C: 031] Stop spawning ssh-keygen but generate fps ourselves [software/keyholder] - 10https://gerrit.wikimedia.org/r/458249 (owner: 10Faidon Liambotis) [07:42:40] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10jcrespo) @Bawolff That new query you propose makes no sense to me- it just selects the first 100 revisions every singl... [07:43:14] (03CR) 1020after4: [C: 031] Abstract the SSH fingerprint generation [software/keyholder] - 10https://gerrit.wikimedia.org/r/458248 (owner: 10Faidon Liambotis) [07:43:42] RECOVERY - Filesystem available is greater than filesystem size on ms-be2042 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops [07:45:52] (03CR) 1020after4: [C: 031] "nice!" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458239 (owner: 10Faidon Liambotis) [07:46:10] !log upgrading intel-microcode on trusty systems to 3.20180807a [07:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:23] (03CR) 1020after4: [C: 031] "pathlib is much nicer, thanks!" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458246 (owner: 10Faidon Liambotis) [07:47:14] (03CR) 1020after4: [C: 031] Phab: Clarify that spaces are not allowed in user account names [puppet] - 10https://gerrit.wikimedia.org/r/455265 (https://phabricator.wikimedia.org/T179126) (owner: 10Aklapper) [07:48:30] (03PS1) 10Marostegui: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461531 (https://phabricator.wikimedia.org/T204006) [07:48:33] (03PS2) 10Mathew.onipe: maps postgresql slow log settings [puppet] - 10https://gerrit.wikimedia.org/r/461206 (https://phabricator.wikimedia.org/T204106) [07:50:09] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461531 (https://phabricator.wikimedia.org/T204006) (owner: 10Marostegui) [07:50:23] (03CR) 10Giuseppe Lavagetto: [C: 031] Revert "Remove mc1021 from mcrouter" [puppet] - 10https://gerrit.wikimedia.org/r/461530 (owner: 10Muehlenhoff) [07:51:38] (03PS4) 10Petar.petkovic: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 [07:51:56] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461531 (https://phabricator.wikimedia.org/T204006) (owner: 10Marostegui) [07:53:08] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2066 (duration: 00m 57s) [07:53:12] !log Deploy schema change on db2066 [07:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:05] (03CR) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461531 (https://phabricator.wikimedia.org/T204006) (owner: 10Marostegui) [07:54:37] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461532 [07:55:47] (03CR) 10Muehlenhoff: [C: 032] Revert "Remove mc1021 from mcrouter" [puppet] - 10https://gerrit.wikimedia.org/r/461530 (owner: 10Muehlenhoff) [07:58:34] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461532 (owner: 10Marostegui) [08:00:13] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461532 (owner: 10Marostegui) [08:01:22] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2066 (duration: 00m 57s) [08:01:25] I will be merging mine afterwards [08:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:37] (03PS2) 10Jcrespo: mariadb: Share API load between db2060 and 67 to avoid overloads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461528 [08:04:31] (03CR) 10Jcrespo: [C: 032] mariadb: Share API load between db2060 and 67 to avoid overloads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461528 (owner: 10Jcrespo) [08:05:44] (03Merged) 10jenkins-bot: mariadb: Share API load between db2060 and 67 to avoid overloads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461528 (owner: 10Jcrespo) [08:06:00] !log replication stopped and tables being compressed for s2 on dbstore2002 [08:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:14] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461532 (owner: 10Marostegui) [08:08:16] (03CR) 10jenkins-bot: mariadb: Share API load between db2060 and 67 to avoid overloads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461528 (owner: 10Jcrespo) [08:10:24] (03PS10) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) [08:10:26] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: revert to using the apache variable SERVER_NAME [puppet] - 10https://gerrit.wikimedia.org/r/461595 [08:13:47] (03CR) 10Muehlenhoff: mediawiki::web::vhost: revert to using the apache variable SERVER_NAME (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/461595 (owner: 10Giuseppe Lavagetto) [08:19:12] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:21:22] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:29:20] (03PS1) 10Gilles: Set up Thumbor Haproxy Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) [08:30:48] <_joe_> gilles: are we moving to haproxy for thumbor? [08:30:56] _joe_: indeed [08:31:30] <_joe_> it's a bit of a pity we need to invest time in this instead than in moving it to kubernetes [08:32:01] <_joe_> I think it's good you do, we're some months away from doing that probably [08:32:05] <_joe_> (moving to k8s) [08:32:16] <_joe_> but we're duplicating a lot of efforts [08:32:29] we're moving to haproxy because we need very specific queue/concurrency handling. I don't know if whatever k8s provides for that will have the same features [08:32:40] it's possible that it will still be fronted by haproxy in k8s [08:33:00] so it's not necessarily a waste of time [08:33:03] <_joe_> well, we have other proxies we use that should serve you but yes, that's a possibility too [08:35:16] what we need is for the proxy to hold a FIFO request queue, while using only 1 connection per thumbor backend. since thumbor is single-threaded this maximizes its throughput without having some long-running requests delay fast ones [08:35:33] that's not possible with nginx [08:36:16] which would send concurrent requests to the same backends, resulting in slow requests delaying fast ones that could have been handled by other idle backends [08:36:26] 10Operations, 10ops-eqiad: mc1021 boot failure - https://phabricator.wikimedia.org/T204812 (10MoritzMuehlenhoff) 05Open>03Resolved Thanks, I finished up the reimage via install_console and re-added it to Icinga, looks all fine now. [08:37:27] gilles: max_conns=number doesn't do it? [08:37:40] nginx has also a queue number [timeout=time] [08:37:41] the problem is the load balancing algo [08:37:52] the one we need is commercial in nginx [08:38:22] out of curiosity which one is it? [08:38:27] with haproxy we can do the ideal thing: unpop from the request queue as soon as a backend is done with a request [08:39:29] I can't remember on the top of my head, let me dig into the nginx docs [08:40:21] I'm fairly sure I tried maxconns on nginx at some point [08:41:18] queue number is commercial according to the docs [08:41:56] least_time would have been interesting, I think that's the one I wanted to try but is commercial as well [08:42:56] the haproxy approach is the most efficient anyway, queued requests going to the next available backend as it frees up. the way you rotate between backends doesn't matter if you maximise their use this way [08:43:10] ah right queue is commercial [08:44:02] I don't care which proxy we use, as long as it can do what I've just described. free nginx can't, and I looked at other open source alternatives and most of them didn't even have a queue you could somewhat control [08:44:05] !log rebooting mc1022 for kernel security update [08:44:10] that's how we ended up with haproxy [08:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:47] _joe_: which proxies do you use for kubernetes? I can look at their docs to see if they do what I need [08:44:57] haproxy is a really nice option, and we are already using it on produciton [08:45:08] <_joe_> gilles: nothing definitive, but we wanted to use envoy [08:46:09] <_joe_> gilles: do you have a phab task I can read about the issues you're solving with HAproxy? I am mostly aware of them but I think my knowledge came from chatting with you in person so it's not set in stone [08:46:22] it may not do a lot of things, but what it does, it does it well [08:46:27] _joe_: https://phabricator.wikimedia.org/T187765 [08:50:01] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:54:22] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:57:40] !log rebooting mc1023 for kernel security update [08:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:16] (03CR) 10Hashar: "recheck" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [09:05:22] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:06:50] (03CR) 10jerkins-bot: [V: 04-1] Implement all the SSH agent bits and stop proxying [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [09:07:32] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:23:44] 10Operations, 10Puppet, 10Cloud-VPS, 10Release-Engineering-Team, and 3 others: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 (10aborrero) [09:23:45] 10Operations, 10Cloud-VPS (Project-requests): Request removal of puppet3-diffs VPS project - https://phabricator.wikimedia.org/T204532 (10aborrero) 05Open>03Resolved a:03aborrero Done :-) [09:24:28] !log rebooting mc1024 for kernel security update [09:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:32] !log Change passwords for wikiuser on dbstore1002 - T200801 [09:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:14] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [20.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [09:31:30] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) I've been researching how the logback/log4j/log4j2 migration could look like, first with... [09:31:42] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:32:18] 10Operations, 10Traffic, 10Patch-For-Review: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173 (10Krenair) @bblack looks like this one should be closed? [09:34:43] RECOVERY - High load average on labstore1007 is OK: OK: Less than 50.00% above the threshold [12.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [09:35:37] I guess the memcache errors are expected due to reboots [09:36:12] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:37:44] !log rebooting mc1025 for kernel security update [09:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:33] 10Operations, 10ops-codfw, 10DBA: Several issues with mgmt interfaces on es200X hosts - https://phabricator.wikimedia.org/T204928 (10jcrespo) [09:40:46] 10Operations, 10ops-codfw, 10DBA: Several issues with mgmt interfaces on es200X hosts - https://phabricator.wikimedia.org/T204928 (10jcrespo) p:05Triage>03Lowest [09:41:01] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, +1 on the idea but please implement under prometheus:: and profile:: like other exporters" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [09:41:31] PROBLEM - Apache HTTP on mw2193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:42:22] RECOVERY - Apache HTTP on mw2193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.094 second response time [09:42:51] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:48:53] (03CR) 10Filippo Giunchedi: Backend-Timing Varnish mtail program (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) (owner: 10Gilles) [09:49:31] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:49:45] 10Operations, 10ops-codfw, 10DBA: Several issues with mgmt interfaces on es200X hosts - https://phabricator.wikimedia.org/T204928 (10jcrespo) [09:50:34] !log rebooting mc1026 for kernel security update [09:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:17] (03PS3) 10Gilles: Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) [09:53:27] (03PS4) 10Gilles: Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) [09:53:38] (03CR) 10Gilles: Backend-Timing Varnish mtail program (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) (owner: 10Gilles) [09:56:11] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:02:42] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:04:05] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) (owner: 10Gilles) [10:05:20] !log rebooting mc1027 for kernel security update [10:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:51] PROBLEM - Host es2003 is DOWN: PING CRITICAL - Packet loss = 100% [10:08:01] RECOVERY - Host es2003 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [10:10:28] jynus, marostegui: ^^ FYI [10:10:37] (03PS2) 10Gilles: Set up Thumbor Haproxy Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) [10:11:13] volans: checking (we are in a meeting) [10:11:26] marostegui: uptime 3 min [10:11:31] got rebooted [10:11:51] I think jaime was upgrading them [10:11:54] I will check if that was one of htem [10:12:08] <_joe_> "Puppet is disabled. reboot" [10:12:19] <_joe_> yes [10:12:20] ah, ok, sorry, didn't see it in SAL [10:14:21] 10Operations, 10Traffic, 10fundraising-tech-ops, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Krenair) [10:14:47] (03CR) 10Faidon Liambotis: "I think the icinga servers are being upgraded to stretch as we speak, so maybe we should just ignore the legacy stuff entirely? :)" [puppet] - 10https://gerrit.wikimedia.org/r/461503 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [10:14:51] 10Operations, 10Traffic, 10fundraising-tech-ops, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Krenair) [10:16:05] !log rebooting mc1028 for kernel security update [10:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:10] (03CR) 10Gilles: Set up Thumbor Haproxy Prometheus exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [10:22:22] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:22:47] 10Operations, 10Traffic, 10fundraising-tech-ops, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Krenair) Just to emphasise, it's not doing anything special on Chrome on my Android phone, and the article linked above shows similar things on some... [10:26:42] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:30:22] !log rebooting mc1029 for kernel security update [10:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:50] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) As far as Python programs using `python-logstash` we should be able to use `logging.SysL... [10:35:04] (03PS1) 10Hashar: Do not merge: capture /tmp in Jenkins [software/keyholder] - 10https://gerrit.wikimedia.org/r/461619 [10:35:45] (03PS1) 10Hashar: Do not merge: pycrypto install with /tmp captured? [software/keyholder] - 10https://gerrit.wikimedia.org/r/461620 [10:36:32] 10Operations, 10Traffic, 10fundraising-tech-ops, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Krenair) [10:37:09] (03CR) 10jerkins-bot: [V: 04-1] Do not merge: pycrypto install with /tmp captured? [software/keyholder] - 10https://gerrit.wikimedia.org/r/461620 (owner: 10Hashar) [10:43:41] !log rebooting mc1030 for kernel security update [10:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:54] (03PS2) 10Hashar: Do not merge: point pip build to /src/log [software/keyholder] - 10https://gerrit.wikimedia.org/r/461619 [10:51:34] (03CR) 10jerkins-bot: [V: 04-1] Do not merge: point pip build to /src/log [software/keyholder] - 10https://gerrit.wikimedia.org/r/461619 (owner: 10Hashar) [10:52:53] (03PS3) 10Hashar: Do not merge: point pip build to /src/log [software/keyholder] - 10https://gerrit.wikimedia.org/r/461619 [10:53:13] (03PS3) 10Gilles: Set up Thumbor Haproxy Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) [10:53:29] (03CR) 10jerkins-bot: [V: 04-1] Do not merge: point pip build to /src/log [software/keyholder] - 10https://gerrit.wikimedia.org/r/461619 (owner: 10Hashar) [10:54:08] (03PS4) 10Hashar: Do not merge: point pip build to /src/log [software/keyholder] - 10https://gerrit.wikimedia.org/r/461619 [10:56:09] (03CR) 10Hashar: "recheck" [software/keyholder] - 10https://gerrit.wikimedia.org/r/461620 (owner: 10Hashar) [10:58:15] !log rebooting mc1031 for kernel security update [10:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T1100). [11:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] o/ [11:00:26] dcausse: you'll deploy your commits? [11:00:40] o/ [11:01:41] zeljkof: it's a php script in docroot and I've never deployed such changes [11:01:59] is it the same process? [11:02:22] dcausse: uh, not sure, let's ask hashar... :D I'm looking at the patch... [11:03:12] dcausse: yes, it should be the same as usual, it's a config change, just deploy the file :D (as far as I can tell) [11:03:21] !log elasticsearch eqiad cluster restart for new systemd unit completed [11:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:48] zeljkof: ok swating then [11:04:18] dcausse: this is as good time as any for "I broke wikipedia" t-shirt :D [11:04:54] :) [11:05:05] I am there yeah [11:05:40] :d [11:05:49] zeljkof: so I have to use deploy1001.eqiad but all other hosts should be in codfw (mwdebug, mwlog) ? [11:06:08] mwdebug in codfw, mwlog in eqiad afaik [11:06:10] zeljkof, what's SWAT's status? Do you have time at least for one other patch? Forgot about SWAT, cannot remember its that early... [11:06:21] mwlog in codfw didnt work for me D: [11:06:36] indeed :) [11:07:34] Urbanecm: just one patch for swat, dcausse is deploying it, if you have patches, the window is mostly available [11:07:39] !log rebooting mc1032 for kernel security update [11:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:49] good, updating calendar zeljkof [11:08:25] (03PS2) 10Urbanecm: Introduce engineer user group on Czech Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455853 (https://phabricator.wikimedia.org/T203000) [11:08:34] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 (owner: 10DCausse) [11:09:57] (03Merged) 10jenkins-bot: search.wikimedia.org should properly handle multivalue separation char (0x1F) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 (owner: 10DCausse) [11:10:53] zeljkof, added another 5 patches [11:12:01] jynus: I see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/461528 merged but not deployed on deploy1001.eqiad.wmnet [11:12:30] that was me sorry [11:12:42] aparently I didn't log nor downtime [11:12:52] as they are essentially disks with no service running [11:13:00] plus trying to do the same while in a meeting [11:13:05] :) [11:13:13] (03PS2) 10Urbanecm: add Radlines.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457474 (https://phabricator.wikimedia.org/T203219) [11:13:20] (03PS2) 10Urbanecm: Fix a typo in zhwikiversity's importsources definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460947 (https://phabricator.wikimedia.org/T201328) [11:13:24] (03PS2) 10Urbanecm: Create eliminator group at Vietnamese Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460701 (https://phabricator.wikimedia.org/T202207) [11:13:25] sorry, answering manuel [11:13:29] but that is the same thing [11:13:37] ^dcausse [11:13:40] ah ok [11:13:44] I will deploy when I can [11:13:52] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:13:53] ok rebasing [11:13:56] you can rebase [11:14:00] (03PS4) 10Urbanecm: Create new namespaces in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455395 (https://phabricator.wikimedia.org/T201675) [11:14:01] sure thanks [11:14:04] it shoudl only touch db-codfw.php [11:14:15] but completly compatible [11:14:15] yes that's it [11:14:33] I am trying to doo too many things at the same time [11:14:38] meetings, reboots and deploys [11:15:03] Urbanecm: ok, will deploy as much as possible [11:15:09] !log rebooting es200X hosts for upgrade [11:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:39] dcausse: you're waiting for 461528 to be deployed before you deploy? [11:15:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] mediawiki: improve siteinfo checks (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:16:22] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54639 MB (3% inode=99%) [11:16:43] thank you zeljkof [11:17:20] (03CR) 10jenkins-bot: search.wikimedia.org should properly handle multivalue separation char (0x1F) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 (owner: 10DCausse) [11:17:57] 10Operations, 10ops-codfw, 10DBA: Several issues with mgmt interfaces on es200X hosts - https://phabricator.wikimedia.org/T204928 (10jcrespo) [11:18:12] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:18:13] please ping me when i'll be needed [11:19:03] Urbanecm: ok [11:19:41] !log dcausse@deploy1001 Synchronized ./docroot/search.wikimedia.org/index.php: search.wikimedia.org should properly handle multivalue separation char (0x1F) (duration: 00m 58s) [11:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:02] zeljkof: I'm done [11:20:17] (03PS3) 10Volans: mediawiki: improve siteinfo checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) [11:20:25] (03CR) 10Volans: "done" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:20:40] 10Operations, 10ops-codfw, 10DBA: Issues with mgmt interface on es2001 host - https://phabricator.wikimedia.org/T204928 (10jcrespo) [11:20:43] dcausse: so, what's the story with a commit at deploy1001? [11:21:02] zeljkof: what do you mean? [11:21:27] dcausse: is there an un-deployed commit there? [11:21:38] zeljkof: there's wmf-config/db-eqiad.php that is updated but no deployed yet [11:21:56] dcausse: so I just ignore it? [11:22:16] yes unless you need to sync the whole dir [11:22:22] (03CR) 10Hashar: "recheck" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [11:22:27] in which case you should sync-up with DBAs I guess [11:22:41] dcausse: thanks [11:24:32] (03Abandoned) 10Hashar: Do not merge: point pip build to /src/log [software/keyholder] - 10https://gerrit.wikimedia.org/r/461619 (owner: 10Hashar) [11:24:35] (03Abandoned) 10Hashar: Do not merge: pycrypto install with /tmp captured? [software/keyholder] - 10https://gerrit.wikimedia.org/r/461620 (owner: 10Hashar) [11:25:01] Urbanecm: please stand by, the first patch will be ready for testing in a few minutes, this time at mwdebug2001.codfw.wmnet (because of datacenter switch) [11:26:04] zeljkof, silly question, how to connect to mwdebug2001.codfw.wmnet? I don't see it in the extension [11:26:29] Urbanecm: ah, sorry "mwdebug2001.codfw.wmnet (alias mw2017.codfw.wmnet)" [11:26:33] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Available_backends [11:26:40] I see mw2017.codfw, okay [11:27:22] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455853 (https://phabricator.wikimedia.org/T203000) (owner: 10Urbanecm) [11:28:21] RECOVERY - Disk space on maps1001 is OK: DISK OK [11:28:39] !log rebooting mc1033 for kernel security update [11:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:51] (03CR) 10Hashar: "pycrypto install failed because /tmp for that job was a tmpfs with the noexec flag. That caused the compilation test to fail to execute a " [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [11:29:19] (03Merged) 10jenkins-bot: Introduce engineer user group on Czech Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455853 (https://phabricator.wikimedia.org/T203000) (owner: 10Urbanecm) [11:32:03] (03CR) 10jenkins-bot: Introduce engineer user group on Czech Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455853 (https://phabricator.wikimedia.org/T203000) (owner: 10Urbanecm) [11:32:34] Urbanecm: 455853 is at mwdebug2001 [11:34:06] zeljkof, working, please deploy it to whole universe [11:34:12] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet operation_type=create_container https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:34:29] Urbanecm ok [11:35:07] (03PS5) 10Zfilipin: Create new namespaces in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455395 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm) [11:35:22] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:35:38] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:455853|Introduce engineer user group on Czech Wikipedia (T203000)]] (duration: 00m 58s) [11:35:45] Urbanecm: it's deployed [11:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:46] T203000: Introduce engineer user group on Czech Wikipedia - https://phabricator.wikimedia.org/T203000 [11:35:52] thx [11:36:58] (03CR) 10jerkins-bot: [V: 04-1] Abstract the SSH fingerprint generation [software/keyholder] - 10https://gerrit.wikimedia.org/r/458248 (owner: 10Faidon Liambotis) [11:37:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455395 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm) [11:38:01] !log rebooting mc1034 for kernel security update [11:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:39] (03Merged) 10jenkins-bot: Create new namespaces in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455395 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm) [11:41:41] hashar, robh: I'm confused. I need to run a script during swat, but when I `ssh mwmaint1001.eqiad.wmnetssh mwmaint1001.eqiad.wmnet` I get "DO NOT USE THIS SERVER" [11:41:55] (03PS1) 10Jcrespo: mariadb: Reduce db2047 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461626 [11:42:07] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Maintenance_scripts says "mwmaint1001.eqiad.wmnet or wasat.codfw.wmnet" [11:42:11] zeljkof: because we have switched to codfw [11:42:22] so I guess you want to try mwmaint2001.codfw.wmnet ? [11:42:31] zeljkof, I'd sync it to the whole cluster before running the script [11:42:53] also the deployment canaries are wrong [11:43:05] scap checks the servers from eqiad but prod has been switched to codfw [11:43:11] so the canaries check are useless ::\ [11:43:25] hashar: ok, mwmaint2001.codfw.wmnet works, so I guess the docs need to be updated from wasat.codfw.wmnet? [11:44:22] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:44:39] Urbanecm: 455395 is at mwdebug2001 [11:45:44] zeljkof, namespace got created, please sync and run the script [11:46:32] Urbanecm: ok [11:46:54] !log rebooting mc1035 for kernel security update [11:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:42] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:455395|Create new namespaces in zhwikiversity (T201675)]] (duration: 00m 57s) [11:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:50] T201675: Create new namespaces in zhwikiversity - https://phabricator.wikimedia.org/T201675 [11:50:12] (03CR) 10Filippo Giunchedi: Set up Thumbor Haproxy Prometheus exporter (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [11:52:26] Urbanecm, hashar: uh oh https://phabricator.wikimedia.org/T201675#4601326 [11:52:34] looking [11:52:39] `Error: 1062 Duplicate entry '104-Shell' for key 'name_title' (10.192.32.103)` [11:53:12] `[53de5b34dcacc7718b19ac00] [no req] Wikimedia\Rdbms\DBQueryError from line 1458 of /srv/mediawiki/php-1.32.0-wmf.22/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? ` [11:53:27] robh: do you know what went wrong? ^ [11:53:41] (03PS1) 10Alexandros Kosiaris: Add nodejs10 docker production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/461627 [11:53:59] Urbanecm: 455395 is deployed, by the way [11:54:07] hmm [11:54:18] (03CR) 10jenkins-bot: Create new namespaces in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455395 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm) [11:54:24] I don't think we should have this deployed without the script being run successfully. [11:54:35] I think we should either manage to run the script or revert [11:54:54] I guess there is a "Shell" page in both namespaces? [11:54:59] Urbanecm: but in this case the script ran for a while, then stopped [11:55:17] Urbanecm: so just a revert, who knows what state things are in [11:55:18] Hmm, didn't notice it is not only dry run [11:55:21] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:55:31] zeljkof, what about --add-prefix=T201675? [11:55:31] T201675: Create new namespaces in zhwikiversity - https://phabricator.wikimedia.org/T201675 [11:55:40] Urbanecm: yes, dry run was fine, but then the real run exploded [11:55:54] (if its only because two Shell pages in both namespaces, the --add-prefix thing should fix it) [11:56:42] hashar: should I do --add-prefix=T201675? [11:57:03] maybe :) [11:57:55] zeljkof: I don't know really [11:58:03] let the script fix the other entries [11:58:09] and see what it reports after that? [11:58:33] hashar, zeljkof already run real run, didn't he? [11:58:35] hashar: so, just re-run the script? how do I let it fix things? [11:58:50] namespaceDupes.php --fix [11:59:00] (03PS2) 10Alexandros Kosiaris: Add nodejs10 docker production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/461627 (https://phabricator.wikimedia.org/T201611) [11:59:03] hashar: ah, yes, I did the dry run, looked good, then I did --fix and it exploded [11:59:25] hashar: are you saying I should just re-run the script? [11:59:34] (03CR) 10Muehlenhoff: [C: 031] Add nodejs10 docker production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/461627 (https://phabricator.wikimedia.org/T201611) (owner: 10Alexandros Kosiaris) [11:59:48] without --add-prefix=T201675? [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T1200) [12:00:36] Urbanecm: ok, we are officially out of time, please move remaining commits to another window [12:00:39] suggestion: rename 0-Subject:shell to something else [12:00:51] using mediawiki, then retry [12:00:53] jynus, the namespaceDupes.php script has parameter for it [12:00:57] --add-prefix :) [12:01:26] See https://www.mediawiki.org/wiki/Manual:NamespaceDupes.php [12:01:32] well, I am suggesting that as an alternative, as you don't seem sure to use it [12:01:44] !log rebooting mc1036 for kernel security update [12:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:10] (not me, I'm not a deployer, zeljkof, I'm just saying what I know :-) ) [12:03:13] Urbanecm,jynus: thanks, I think --add-prefix is the way to go [12:05:04] Urbanecm, hashar: ok, --add-prefix did the trick, pasting output to phab [12:05:11] magic [12:05:32] good, thank you zeljkof for your deploys [12:05:48] 104-T201675Shell [12:06:31] it is a redirect [12:06:46] jynus, yeah, up to community to resolve the conflict (probably by deleting) [12:06:52] https://zh.wikiversity.org/w/index.php?title=Subject:T201675Shell&redirect=no [12:07:46] thanks everybody [12:07:51] !log EU SWAT finished [12:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:53] !log starting elasticsearch codfw cluster restart for new systemd unit [12:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:07] onimisionipe: ^^ [12:15:25] (03PS4) 10Gilles: Set up Thumbor Haproxy Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) [12:15:29] (03CR) 10Gilles: Set up Thumbor Haproxy Prometheus exporter (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:16:14] (03CR) 10jerkins-bot: [V: 04-1] Set up Thumbor Haproxy Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:17:26] (03CR) 10Gehel: [C: 031] "+1 for samples!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:19:20] gehel: Ok.. cool [12:21:12] PROBLEM - Memory correctable errors -EDAC- on mw2181 is CRITICAL: 4 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw2181&var-datasource=codfw%2520prometheus%252Fops [12:23:12] (03PS5) 10Gilles: Set up Thumbor Haproxy Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) [12:26:33] (03CR) 10Gehel: [C: 04-1] "Looks like logging of slow queries is actually already enabled. We probably don't need to be more aggressive, but should already have a lo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/461206 (https://phabricator.wikimedia.org/T204106) (owner: 10Mathew.onipe) [12:27:52] (03CR) 10Hashar: Abstract the SSH fingerprint generation (031 comment) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458248 (owner: 10Faidon Liambotis) [12:28:50] (03PS2) 10Bstorm: standalone-puppetmaster: stretch compatibility improvement [puppet] - 10https://gerrit.wikimedia.org/r/461419 [12:29:43] (03CR) 10Bstorm: [C: 032] standalone-puppetmaster: stretch compatibility improvement [puppet] - 10https://gerrit.wikimedia.org/r/461419 (owner: 10Bstorm) [12:30:43] (03Abandoned) 10Mathew.onipe: maps postgresql slow log settings [puppet] - 10https://gerrit.wikimedia.org/r/461206 (https://phabricator.wikimedia.org/T204106) (owner: 10Mathew.onipe) [12:33:55] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/12518/" [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:33:58] (03CR) 10Filippo Giunchedi: [C: 032] Set up Thumbor Haproxy Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:34:09] (03PS6) 10Filippo Giunchedi: Set up Thumbor Haproxy Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/461596 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:41:58] (03PS1) 10Filippo Giunchedi: prometheus: do not quote haproxy-exporter systemd args [puppet] - 10https://gerrit.wikimedia.org/r/461631 (https://phabricator.wikimedia.org/T187765) [12:42:30] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: do not quote haproxy-exporter systemd args [puppet] - 10https://gerrit.wikimedia.org/r/461631 (https://phabricator.wikimedia.org/T187765) (owner: 10Filippo Giunchedi) [12:44:42] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:45:41] PROBLEM - Check systemd state on thumbor1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:45:52] PROBLEM - Check systemd state on thumbor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:46:46] (03CR) 10Volans: [C: 04-1] "Mostly small things." (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [12:47:04] gilles: is that you? ^^^ [12:47:29] volans: possibly the thing godog just merged [12:47:32] prometheus-haproxy-exporter.service failed [12:47:36] yes [12:47:49] yeah that's correct, it'll recover soon [12:48:32] does it do that because it identifies the service as missing since it's never been started before? [12:49:07] no because it genuinely failed, fixed it in https://gerrit.wikimedia.org/r/461631 [12:49:51] gotcha :) [12:51:21] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational [12:51:22] RECOVERY - Check systemd state on thumbor2001 is OK: OK - running: The system is fully operational [12:52:12] RECOVERY - Check systemd state on thumbor1003 is OK: OK - running: The system is fully operational [12:52:31] (03PS1) 10GTirloni: shinken - Change Puppet thresholds [puppet] - 10https://gerrit.wikimedia.org/r/461632 (https://phabricator.wikimedia.org/T161898) [12:54:08] (03CR) 10GTirloni: "Enlisting you all since these Shinken definitions seem tricky to understand for me. Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/461632 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [13:00:05] hashar: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T1300). [13:08:10] 10Operations, 10Analytics, 10Analytics-Cluster, 10Cloud-Services: notebook1003 failed network mount on boot - https://phabricator.wikimedia.org/T204857 (10elukey) @Bstorm Thanks a lot for the ping! [13:15:32] PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdn1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdn1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [13:17:22] (03PS2) 10Jcrespo: mariadb: Reduce db2047 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461626 [13:18:27] 10Operations, 10cloud-services-team: Ferm leftovers on labtestnet2003 - https://phabricator.wikimedia.org/T204667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gtirloni on sarin.codfw.wmnet for hosts: ``` labtestnet2003.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018092... [13:20:34] !log T204667 reimage labtestnet2003 [13:20:40] 10Operations, 10ops-eqiad: Heating alerts / memory errors on mw1254 - https://phabricator.wikimedia.org/T204491 (10MoritzMuehlenhoff) Ok, I've repooled the server for now. [13:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:48] T204667: Ferm leftovers on labtestnet2003 - https://phabricator.wikimedia.org/T204667 [13:22:38] 10Operations: Add favicon to icinga and tendril - https://phabricator.wikimedia.org/T204110 (10jcrespo) Icinga is now showing me an icon, but etherpad isn't. Not sure how much of this is valid or it is my browser cache misbehaving. [13:23:08] (03CR) 10Marostegui: [C: 031] mariadb: Reduce db2047 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461626 (owner: 10Jcrespo) [13:23:15] jouncebot: next [13:23:15] In 2 hour(s) and 36 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T1600) [13:24:32] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1): exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Ottomata) @Krenair oh ya?... [13:25:15] (03CR) 10Jcrespo: [C: 032] mariadb: Reduce db2047 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461626 (owner: 10Jcrespo) [13:27:22] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Tune codfw database weights (duration: 00m 58s) [13:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:41] (03PS4) 10C. Scott Ananian: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 [13:28:30] (03CR) 10C. Scott Ananian: "Split off the dependency into its own patch, I5bbf3e4f65d9b6a0d7419f67e3931e77e92b7e6c, to simplify matters." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian) [13:31:07] oh the train [13:31:24] will do it in a few [13:35:42] my issue being the canaries not being codfw ones [13:43:15] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1): exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Krenair) I think so - I'm p... [13:44:07] (03CR) 10jenkins-bot: mariadb: Reduce db2047 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461626 (owner: 10Jcrespo) [13:46:39] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10elukey) @RobH @Cmjohnson any chance that we can get this done this week? I am asking since we have an important maintenance window scheduled for Tuesd... [13:48:16] PROBLEM - Filesystem available is greater than filesystem size on ms-be2043 is CRITICAL: cluster=swift device=/dev/sdk1 fstype=xfs instance=ms-be2043:9100 job=node mountpoint=/srv/swift-storage/sdk1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [13:50:43] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Banyek) [13:53:35] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::vhost: revert to using the apache variable SERVER_NAME [puppet] - 10https://gerrit.wikimedia.org/r/461595 [13:53:37] (03PS11) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) [13:54:26] 10Operations, 10Puppet, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) [13:54:44] (03PS3) 10Paladox: puppetdb: Don't try to install tuning.conf until dir/package exists [puppet] - 10https://gerrit.wikimedia.org/r/435677 (owner: 10Alex Monk) [13:56:21] (03PS1) 10Hashar: scap: use mediawiki canaries from codfw [puppet] - 10https://gerrit.wikimedia.org/r/461637 (https://phabricator.wikimedia.org/T204907) [13:56:53] (03CR) 10jerkins-bot: [V: 04-1] scap: use mediawiki canaries from codfw [puppet] - 10https://gerrit.wikimedia.org/r/461637 (https://phabricator.wikimedia.org/T204907) (owner: 10Hashar) [13:56:58] <_joe_> hashar: that's not how I was proposing to do it [13:58:20] (03PS2) 10Hashar: scap: use mediawiki canaries from codfw [puppet] - 10https://gerrit.wikimedia.org/r/461637 (https://phabricator.wikimedia.org/T204907) [13:59:19] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) 300 errors in the last 24h, I think we are good? [14:01:56] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/461595 (owner: 10Giuseppe Lavagetto) [14:02:16] (03CR) 10Hashar: "puppet compile: https://puppet-compiler.wmflabs.org/compiler1002/12519/" [puppet] - 10https://gerrit.wikimedia.org/r/461637 (https://phabricator.wikimedia.org/T204907) (owner: 10Hashar) [14:05:21] !log rebooting rdb1002 for kernel security update [14:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:23] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::vhost: revert to using the apache variable SERVER_NAME (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/461595 (owner: 10Giuseppe Lavagetto) [14:09:13] (03CR) 10Andrew Bogott: [C: 031] "Worth a try!" [puppet] - 10https://gerrit.wikimedia.org/r/461632 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [14:10:38] (03PS4) 10Herron: Kill wiki-mail.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/143762 (owner: 10Faidon Liambotis) [14:11:22] (03CR) 10Herron: [C: 032] Kill wiki-mail.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/143762 (owner: 10Faidon Liambotis) [14:11:51] !log removing unused wiki-mail.wikimedia.org cname (gerrit 143762) [14:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:56] !log upload gdnsd-2.99.42-beta to stretch-wikimedia [14:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:43] !log upgrade authdns1001 gdnsd 2.99.9 -> 2.99.42 [14:15:48] !log rebooting rdb1001 for kernel security update [14:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:50] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad | (3) Labs Data Lake hardware - https://phabricator.wikimedia.org/T199674 (10Nuria) [14:20:08] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/12520/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:22:12] (03CR) 10Alexandros Kosiaris: [C: 032] scap: use mediawiki canaries from codfw [puppet] - 10https://gerrit.wikimedia.org/r/461637 (https://phabricator.wikimedia.org/T204907) (owner: 10Hashar) [14:22:19] (03PS3) 10Alexandros Kosiaris: scap: use mediawiki canaries from codfw [puppet] - 10https://gerrit.wikimedia.org/r/461637 (https://phabricator.wikimedia.org/T204907) (owner: 10Hashar) [14:22:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] scap: use mediawiki canaries from codfw [puppet] - 10https://gerrit.wikimedia.org/r/461637 (https://phabricator.wikimedia.org/T204907) (owner: 10Hashar) [14:22:36] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 55195 MB (3% inode=99%) [14:22:42] !log Deploy schema change on s5 eqiad master with replication T204006 [14:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:50] T204006: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 [14:25:13] (03CR) 10Giuseppe Lavagetto: "With the last patchset, diffs are all whitespaces or redundancies removed. I think it's GTG." [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:25:16] (03PS1) 10Elukey: statistics::user: force git webproxy only in production [puppet] - 10https://gerrit.wikimedia.org/r/461639 [14:27:02] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12521/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/461639 (owner: 10Elukey) [14:27:21] (03PS2) 10Elukey: statistics::user: force git webproxy only in production [puppet] - 10https://gerrit.wikimedia.org/r/461639 [14:27:32] !log maps1001:~# tune2fs -m0 /dev/mapper/maps1001--vg-data [14:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:05] RECOVERY - Disk space on maps1001 is OK: DISK OK [14:28:34] !log hashar@deploy1001 Synchronized typos: Dummy sync to verify list of canaries for T204907 (duration: 00m 59s) [14:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:41] T204907: Scap is checking canary servers in dormant instead of active-dc - https://phabricator.wikimedia.org/T204907 [14:29:01] I freed 5% space by adjusting root reserve on maps1001, but still looks to need some cleanup on /srv. both postgres and cassandra look quite large. gehel are you the right person to ping about that? [14:30:54] herron: I am [14:31:06] and thanks for looking! [14:31:11] np! [14:31:14] (03PS3) 10Elukey: statistics::user: force git webproxy only in production [puppet] - 10https://gerrit.wikimedia.org/r/461639 [14:31:29] (03CR) 10Ottomata: [C: 031] statistics::user: force git webproxy only in production [puppet] - 10https://gerrit.wikimedia.org/r/461639 (owner: 10Elukey) [14:32:42] (03PS4) 10Elukey: statistics::user: force git webproxy only in production [puppet] - 10https://gerrit.wikimedia.org/r/461639 [14:32:54] I am going to promote group2 [14:32:55] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12523/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/461639 (owner: 10Elukey) [14:33:16] I am late, I have been distracted by other duties [14:34:39] (03PS1) 10Hashar: all wikis to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461640 [14:34:41] (03CR) 10Hashar: [C: 032] all wikis to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461640 (owner: 10Hashar) [14:35:58] (03CR) 10ArielGlenn: "This looks good enough to merge, and I've done some (mw-vagrant) testing which also looks good. Truthy dumps are still running but I could" [puppet] - 10https://gerrit.wikimedia.org/r/447922 (https://phabricator.wikimedia.org/T144103) (owner: 10Smalyshev) [14:36:26] (03CR) 10ArielGlenn: [C: 031] "I'll convert to a +2 tomorrow if people think it's mergeable." [puppet] - 10https://gerrit.wikimedia.org/r/447922 (https://phabricator.wikimedia.org/T144103) (owner: 10Smalyshev) [14:36:32] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461640 (owner: 10Hashar) [14:38:57] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.22 [14:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:56] (03CR) 10Alexandros Kosiaris: Add nodejs10 docker production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/461627 (https://phabricator.wikimedia.org/T201611) (owner: 10Alexandros Kosiaris) [14:40:03] (03PS3) 10Alexandros Kosiaris: Add nodejs10 docker production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/461627 (https://phabricator.wikimedia.org/T201611) [14:41:49] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461640 (owner: 10Hashar) [14:43:09] (03CR) 10Muehlenhoff: mediawiki::web::prod_sites: convert loginwiki, chapterwiki (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:44:09] !log reduce replication factor to 2 on cassandra maps eqiad - T194966 [14:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:17] T194966: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966 [14:47:48] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Access to Maps serverss - https://phabricator.wikimedia.org/T204960 (10Mathew.onipe) [14:48:57] !log repair sdn on ms-be2040 - T199198 [14:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:05] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [14:49:30] it seems group2 is working [14:53:35] (03PS1) 10Mathew.onipe: Add Matt to maps-admin [puppet] - 10https://gerrit.wikimedia.org/r/461642 (https://phabricator.wikimedia.org/T204960) [14:53:46] ACKNOWLEDGEMENT - Check health of redis instance on 6378 on rdb1001 is CRITICAL: CRITICAL: replication_delay is 1537455203 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 33 minutes 11 seconds - replication_delay is 1537455203 Muehlenhoff replication doesnt matter here anymore [14:53:46] ACKNOWLEDGEMENT - Check health of redis instance on 6379 on rdb1001 is CRITICAL: CRITICAL: replication_delay is 1537455203 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705935 keys, up 33 minutes 11 seconds - replication_delay is 1537455203 Muehlenhoff replication doesnt matter here anymore [14:53:46] ACKNOWLEDGEMENT - Check health of redis instance on 6381 on rdb1001 is CRITICAL: CRITICAL: replication_delay is 1537455208 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3463 keys, up 33 minutes 16 seconds - replication_delay is 1537455208 Muehlenhoff replication doesnt matter here anymore [14:54:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:56:29] that was ores in codfw btw [14:56:50] I have filled https://phabricator.wikimedia.org/T204961 about ores and http request time out [14:57:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:01:01] (03PS1) 10BBlack: multatuli: reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/461646 [15:01:50] thanks hashar [15:02:00] godog: I am not sure what is going with ORES [15:02:08] seems it errored out briefly once group2 got promotted [15:02:09] (03CR) 10C. Scott Ananian: "> This doesn't need to be SWAT'ed, just the normal deploy cycle is" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460034 (owner: 10Arlolra) [15:02:12] maybe an overload? [15:03:30] 10Operations, 10cloud-services-team: Ferm leftovers on labtestnet2003 - https://phabricator.wikimedia.org/T204667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labtestnet2003.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['labtestnet2003.codfw.wmnet'] ``` [15:04:13] (03CR) 10C. Scott Ananian: "Train is running on time AFAICT, this patch scheduled for the SWAT later tonight." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [15:05:00] !log installing bind9 security updates (client-side tools and libraries) [15:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:07] !log 1.32.0-wmf.22 on group2 seems fine so far \o/ [15:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:37] (03PS6) 10C. Scott Ananian: Remove $wgUseTidy and $wgTidyConfig from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) [15:07:50] (03CR) 10Muehlenhoff: "I currently use this host to test jessie kernels (and other jessie packages best tested in prod), ideally we'd have a second, stretch-base" [puppet] - 10https://gerrit.wikimedia.org/r/461646 (owner: 10BBlack) [15:10:06] (03PS1) 10ArielGlenn: make path to MWScript.php configurable for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/461650 (https://phabricator.wikimedia.org/T204962) [15:11:13] (03PS1) 10ArielGlenn: make location of MWScript.php configurable for xml/sql dumps [dumps] - 10https://gerrit.wikimedia.org/r/461651 (https://phabricator.wikimedia.org/T204962) [15:11:38] !log Starting mwscript extensions/PageTriage/maintenance/DeleteAfcStates.php --wiki enwiki (T203184) [15:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:46] T203184: Deploy PageTriage AfC to production - https://phabricator.wikimedia.org/T203184 [15:11:53] (03PS1) 10Muehlenhoff: Add library hint for glib2.0 [puppet] - 10https://gerrit.wikimedia.org/r/461652 [15:13:05] (03CR) 10Muehlenhoff: [C: 032] Add library hint for glib2.0 [puppet] - 10https://gerrit.wikimedia.org/r/461652 (owner: 10Muehlenhoff) [15:14:17] !log Finished mwscript extensions/PageTriage/maintenance/DeleteAfcStates.php --wiki enwiki (T203184) [15:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:45] (03PS2) 10BBlack: multatuli: reimage as stretch authdns [puppet] - 10https://gerrit.wikimedia.org/r/461646 [15:14:47] (03PS1) 10BBlack: Fix resolv.conf config for esams dns servers [puppet] - 10https://gerrit.wikimedia.org/r/461655 [15:15:05] !log Starting mwscript extensions/PageTriage/maintenance/populateDraftQueue.php --wiki enwiki (T203184) [15:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:19] 10Operations, 10cloud-services-team: Ferm leftovers on labtestnet2003 - https://phabricator.wikimedia.org/T204667 (10GTirloni) Reimage completed. The error was caused by me running Puppet manually instead of letting wmf-auto-reimage do it. [15:15:36] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [15:17:43] !log installing glib2.0 security updates for trusty [15:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:07] 10Operations, 10cloud-services-team: Ferm leftovers on labtestnet2003 - https://phabricator.wikimedia.org/T204667 (10GTirloni) @MoritzMuehlenhoff anything else that needs to be done for this task? [15:24:42] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Access to Maps servers - https://phabricator.wikimedia.org/T204960 (10Aklapper) [15:29:11] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Access to Maps servers - https://phabricator.wikimedia.org/T204960 (10Gehel) I confirm that allowing @Mathew.onipe to access maps servers as a member of the `maps-admins` team make sense and is reasonable. [15:31:22] (03PS1) 10Muehlenhoff: Bump changelog for new debdeploy package [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/461664 [15:32:05] (03CR) 10Muehlenhoff: [V: 032 C: 032] Bump changelog for new debdeploy package [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/461664 (owner: 10Muehlenhoff) [15:32:33] (03PS1) 10Jcrespo: mariadb: First version of mariadb backup fresness alert [puppet] - 10https://gerrit.wikimedia.org/r/461665 (https://phabricator.wikimedia.org/T203969) [15:33:23] (03CR) 10jerkins-bot: [V: 04-1] mariadb: First version of mariadb backup fresness alert [puppet] - 10https://gerrit.wikimedia.org/r/461665 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [15:34:09] (03PS1) 10ArielGlenn: make 'misc cron dumps' use a configured path to MWScript.php [puppet] - 10https://gerrit.wikimedia.org/r/461667 (https://phabricator.wikimedia.org/T204962) [15:34:54] 10Operations, 10cloud-services-team: Ferm leftovers on labtestnet2003 - https://phabricator.wikimedia.org/T204667 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff >>! In T204667#4602113, @GTirloni wrote: > @MoritzMuehlenhoff anything else that needs to be done for this task? No, all good now. [15:35:47] (03CR) 10Muehlenhoff: [C: 031] Add nodejs10 docker production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/461627 (https://phabricator.wikimedia.org/T201611) (owner: 10Alexandros Kosiaris) [15:37:53] (03PS1) 10Elukey: profile::analytics::cluster::gitconfig: apply on in the prod realm [puppet] - 10https://gerrit.wikimedia.org/r/461671 [15:39:50] (03PS2) 10Jcrespo: mariadb: First version of mariadb backup fresness alert [puppet] - 10https://gerrit.wikimedia.org/r/461665 (https://phabricator.wikimedia.org/T203969) [15:39:55] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add nodejs10 docker production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/461627 (https://phabricator.wikimedia.org/T201611) (owner: 10Alexandros Kosiaris) [15:40:08] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12524/" [puppet] - 10https://gerrit.wikimedia.org/r/461671 (owner: 10Elukey) [15:41:10] (03PS3) 10Jcrespo: mariadb: First version of mariadb backup fresness alert [puppet] - 10https://gerrit.wikimedia.org/r/461665 (https://phabricator.wikimedia.org/T203969) [15:45:56] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:47:38] 10Operations, 10ops-eqiad, 10Analytics: setup/install new analytics servers (hostname needed) - https://phabricator.wikimedia.org/T204970 (10RobH) p:05Triage>03Normal [15:48:08] 10Operations, 10ops-eqiad, 10Analytics: setup/install new analytics servers (hostname needed) - https://phabricator.wikimedia.org/T204970 (10RobH) @elukey, Please advise on: * This needs to be two servers with identical hardware, correct? * What is the hostname standard for these two new hosts? [15:48:13] (03PS6) 10Anomie: Set MCR migration to write-both/read-new on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454534 (https://phabricator.wikimedia.org/T198309) (owner: 10Daniel Kinzler) [15:49:24] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454534 (https://phabricator.wikimedia.org/T198309) (owner: 10Daniel Kinzler) [15:50:04] (03PS1) 10Jcrespo: backups: Setup new backup check on tendril db [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) [15:50:49] (03CR) 10jerkins-bot: [V: 04-1] backups: Setup new backup check on tendril db [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [15:51:17] (03Merged) 10jenkins-bot: Set MCR migration to write-both/read-new on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454534 (https://phabricator.wikimedia.org/T198309) (owner: 10Daniel Kinzler) [15:52:16] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:53:07] (03CR) 10jenkins-bot: Set MCR migration to write-both/read-new on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454534 (https://phabricator.wikimedia.org/T198309) (owner: 10Daniel Kinzler) [15:53:16] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 26937 MB (5% inode=99%) [15:53:54] 10Operations, 10ops-eqiad, 10Analytics: setup/install new analytics servers (hostname needed) wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) [15:55:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:56:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:56:20] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting MCR migration stage to write-both/read-new on testwiki (T198309) (duration: 00m 59s) [15:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:29] T198309: Enable MCR migration stage "write both, read new" on testwiki - https://phabricator.wikimedia.org/T198309 [15:57:42] 10Operations, 10ops-eqiad, 10Analytics: setup/install new analytics servers (hostname needed) wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) [15:58:00] 10Operations, 10ops-eqiad, 10Analytics: setup/install analytics-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) [16:00:04] godog and _joe_: #bothumor I � Unicode. All rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:03:56] RECOVERY - Disk space on elastic1027 is OK: DISK OK [16:05:19] 10Operations, 10Wikimedia-Mailing-lists: Request new mail list for Vietnam Wikimedians User Group - https://phabricator.wikimedia.org/T204974 (10minhhuy) [16:06:14] 10Operations, 10Wikimedia-Mailing-lists: Request new mail list for Vietnam Wikimedians User Group - https://phabricator.wikimedia.org/T204974 (10minhhuy) [16:33:29] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10RobH) Ok, hostname update. We need to change the port descriptions, rack table entries, and physical labels on these to an-master100X, not analytics-... [16:35:49] 10Operations, 10Cleanup, 10GitHub-Mirrors, 10OCG-General, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10MarcoAurelio) [16:37:45] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:38:35] 10Operations, 10Cleanup, 10GitHub-Mirrors, 10OCG-General, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10MarcoAurelio) [16:38:52] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) I tried experimenting with emulating a `json_lines`-compatible udp local endpoint with r... [16:39:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:44:01] (03PS13) 10Dzahn: dumps: monitor generation nfs server hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [16:44:36] it's a bug that the bug bot quits when there are too many bugs [16:47:21] (03CR) 10Dzahn: [C: 032] dumps: monitor generation nfs server hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [16:47:41] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1001/12526/" [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [16:52:55] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:54:57] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 47.74 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:54:59] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:56:54] 10Operations, 10Datasets-General-or-Unknown, 10monitoring, 10Patch-For-Review: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680 (10Dzahn) merged, ran puppet on dumpsdata1001 and 1002. verified NRPE was refreshed and ok, Icinga config also ok,... [16:58:34] 10Operations, 10CirrusSearch, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): Resolve elasticsearch shard size alert by doing an in reindex - https://phabricator.wikimedia.org/T204362 (10Mathew.onipe) p:05Normal>03High [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T1700). [17:00:15] I have deployment for ores [17:01:03] no parsoid deployment today [17:01:58] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 70.05 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:03:22] !log ladsgroup@deploy1001 Started deploy [ores/deploy@ee2d28b]: Returning 429 instead of 408 in case of too many requests (T204956) [17:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:30] T204956: ORES should return 429 instead of 408 in case of too many requests - https://phabricator.wikimedia.org/T204956 [17:03:43] (03CR) 10Dzahn: [C: 032] "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=dumpsdata1002&service=nfsd+cpu+usage" [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [17:04:03] 10Operations, 10Datasets-General-or-Unknown, 10monitoring, 10Patch-For-Review: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680 (10Dzahn) 05stalled>03Resolved a:03Dzahn [17:06:49] works fine on canary, moving to prod [17:07:26] (03CR) 10Dzahn: "yea, we are working on it but might take still a bit. currently not sure how long exactly. i guess i would have picked the "Only change th" [puppet] - 10https://gerrit.wikimedia.org/r/461503 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [17:08:26] (03PS5) 10Petar.petkovic: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 [17:10:48] (03CR) 10Nikerabbit: [C: 031] Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 (owner: 10Petar.petkovic) [17:13:05] (03Abandoned) 10Dzahn: Revert "profile::mediawiki::maintenance: depend on mediawiki config, not hiera" [puppet] - 10https://gerrit.wikimedia.org/r/461496 (owner: 10Dzahn) [17:13:49] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install an-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10RobH) a:05Cmjohnson>03RobH [17:13:53] thcipriani: FYI I'll be uploading the new scap shortly [17:14:04] (03CR) 10Dzahn: [C: 031] "the last comment was wrong. eqiad is currently not the active dc, obviously." [puppet] - 10https://gerrit.wikimedia.org/r/461491 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [17:14:16] (03CR) 10Dzahn: [C: 031] "(my own comment that is)" [puppet] - 10https://gerrit.wikimedia.org/r/461491 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [17:14:22] (03PS1) 10Ladsgroup: icinga: Update link to ORES in comments [puppet] - 10https://gerrit.wikimedia.org/r/461691 [17:14:33] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) [17:14:57] godog: oh awesome! thank you [17:15:03] mutante: hey, this one is pretty straightforward: https://gerrit.wikimedia.org/r/c/operations/puppet/+/461691/1/modules/icinga/manifests/monitor/ores.pp#b1 [17:16:31] yes Amir1, was already looking [17:16:57] (03CR) 10Dzahn: [C: 032] icinga: Update link to ORES in comments [puppet] - 10https://gerrit.wikimedia.org/r/461691 (owner: 10Ladsgroup) [17:17:46] 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-Access-Requests: Add Matt to restricted group so he can access script nodes - https://phabricator.wikimedia.org/T204980 (10Mathew.onipe) [17:18:10] (03PS2) 10Filippo Giunchedi: Scap: upgrade to 3.8.6-1 [puppet] - 10https://gerrit.wikimedia.org/r/460610 (https://phabricator.wikimedia.org/T204383) (owner: 10Thcipriani) [17:18:15] (03CR) 10Filippo Giunchedi: [C: 032] Scap: upgrade to 3.8.6-1 [puppet] - 10https://gerrit.wikimedia.org/r/460610 (https://phabricator.wikimedia.org/T204383) (owner: 10Thcipriani) [17:18:18] Thank you! [17:18:46] you're welcome [17:19:48] (03CR) 10Dzahn: [C: 031] "per "ensure that it doesn't change if the package maitnenance decide to change the default for debian 10". haven't tested though" [puppet] - 10https://gerrit.wikimedia.org/r/461503 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [17:22:22] (03PS2) 10BBlack: Fix resolv.conf config for esams dns servers [puppet] - 10https://gerrit.wikimedia.org/r/461655 [17:22:57] (03PS1) 10RobH: updating an-master100[12] dns entries [dns] - 10https://gerrit.wikimedia.org/r/461692 (https://phabricator.wikimedia.org/T201939) [17:23:13] (03CR) 10Dzahn: [C: 031] "thank you for the fast deploy :)) yes, another change will follow to remove the old server once it's switched" [puppet] - 10https://gerrit.wikimedia.org/r/461493 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [17:23:17] (03PS1) 10Mathew.onipe: Add Matt to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/461693 (https://phabricator.wikimedia.org/T204980) [17:23:43] thcipriani: upgrade done on deploy1001 FYI [17:23:50] (03CR) 10RobH: [C: 032] updating an-master100[12] dns entries [dns] - 10https://gerrit.wikimedia.org/r/461692 (https://phabricator.wikimedia.org/T201939) (owner: 10RobH) [17:23:58] (03PS2) 10Dzahn: site: add mediawiki_maintenance role to mwmaint1002 [puppet] - 10https://gerrit.wikimedia.org/r/461491 (https://phabricator.wikimedia.org/T201343) [17:24:01] !log upload scap 3.8.6-1 - T204383 [17:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:09] T204383: Update Debian Package for Scap to 3.8.6-1 - https://phabricator.wikimedia.org/T204383 [17:24:21] 10Operations, 10Wikimedia-Mailing-lists: Open Foundation West Africa (OFWA) mailing list - https://phabricator.wikimedia.org/T203966 (10Flixtey) @Aklapper Can you please help here, what is the usual process after filing this ticket for a mailing list. Its about 10 days since I made the request. [17:24:29] !log repair sdk on ms-be2043 - T199198 [17:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:36] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [17:24:37] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@ee2d28b]: Returning 429 instead of 408 in case of too many requests (T204956) (duration: 21m 15s) [17:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:44] T204956: ORES should return 429 instead of 408 in case of too many requests - https://phabricator.wikimedia.org/T204956 [17:25:33] godog: i imagine you just typing "repair sdk 2043" anymore on your local shell.. and then it does everything.. the log message, ssh to the server, screen and xfs_repair , heh [17:26:21] mutante: haha no I follow the instructions on wikitech [17:26:39] godog: awesome, I need to run a full scap sync to check it, I think I have time for that before SWAT [17:26:42] ..icinga eventhandler auto-running it seems possible... [17:27:33] godog: :) if we can turn it into a script we could let icinga start it on crit.. [17:28:26] (03CR) 10BBlack: [C: 032] Fix resolv.conf config for esams dns servers [puppet] - 10https://gerrit.wikimedia.org/r/461655 (owner: 10BBlack) [17:31:50] (03CR) 10BBlack: [C: 032] multatuli: reimage as stretch authdns [puppet] - 10https://gerrit.wikimedia.org/r/461646 (owner: 10BBlack) [17:31:58] !log thcipriani@deploy1001 Started scap: Noop (hopefully) test of php7.0 [17:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:42] 10Operations, 10Datasets-General-or-Unknown, 10monitoring, 10Patch-For-Review: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680 (10Dzahn) @Bstorm ^ i heard you might be interested in copying this for other NFS servers. we just added these check... [17:32:51] mutante: heh sort of a dangerous operation still, spicerack would do it though [17:32:55] thcipriani: ack [17:32:57] (03CR) 10BBlack: [C: 032] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/461646 (owner: 10BBlack) [17:33:07] I have to go, though page/call me if needed [17:33:25] alright cya godog [17:35:05] !log thcipriani@deploy1001 Finished scap: Noop (hopefully) test of php7.0 (duration: 03m 07s) [17:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:38] (03PS1) 10RobH: setting base install params for an-master100[12] [puppet] - 10https://gerrit.wikimedia.org/r/461695 (https://phabricator.wikimedia.org/T201939) [17:37:13] (03PS2) 10RobH: setting base install params for an-master100[12] [puppet] - 10https://gerrit.wikimedia.org/r/461695 (https://phabricator.wikimedia.org/T201939) [17:37:19] (03CR) 10RobH: [C: 032] setting base install params for an-master100[12] [puppet] - 10https://gerrit.wikimedia.org/r/461695 (https://phabricator.wikimedia.org/T201939) (owner: 10RobH) [17:37:20] godog: wowza that was fast, ok, lgtm thank you for the update! [17:37:41] (03PS3) 10BBlack: multatuli: reimage as stretch authdns [puppet] - 10https://gerrit.wikimedia.org/r/461646 [17:38:14] (03CR) 10Smalyshev: "I think it's ready." [puppet] - 10https://gerrit.wikimedia.org/r/447922 (https://phabricator.wikimedia.org/T144103) (owner: 10Smalyshev) [17:38:57] (03PS4) 10BBlack: multatuli: reimage as stretch authdns [puppet] - 10https://gerrit.wikimedia.org/r/461646 [17:39:00] (03CR) 10BBlack: [V: 032 C: 032] multatuli: reimage as stretch authdns [puppet] - 10https://gerrit.wikimedia.org/r/461646 (owner: 10BBlack) [17:40:07] 10Operations, 10ops-eqiad, 10Analytics: setup/install analytics-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) [17:40:16] 10Operations, 10ops-eqiad, 10Analytics: setup/install analytics-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) a:05elukey>03RobH [17:40:40] 10Operations, 10ops-eqiad, 10Analytics: setup/install analytics-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) [17:41:20] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install an-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10RobH) [17:46:03] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10thcipriani) 05Open>03Resolved a:03thcipriani I ran a no-op `scap sync` using the patch... [17:46:53] (03PS3) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461117 (https://phabricator.wikimedia.org/T191086) [17:47:46] 10Operations, 10Wikimedia-Mailing-lists: Open Foundation West Africa (OFWA) mailing list - https://phabricator.wikimedia.org/T203966 (10Aklapper) @Flixtey: Hi, this is #Operations so the process described on https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty applies [17:48:28] RECOVERY - Filesystem available is greater than filesystem size on ms-be2043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [17:52:38] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10Legoktm) Yay! :-) [17:52:47] 10Operations, 10Traffic: Consider adding Must-Staple header to enforce revocation checking - https://phabricator.wikimedia.org/T204987 (10Krenair) [17:52:53] the PHP 7 train is chugging along :D [17:53:00] 10Operations, 10Traffic: Consider adding Must-Staple header to enforce revocation checking - https://phabricator.wikimedia.org/T204987 (10Krenair) https://scotthelme.co.uk/designing-a-new-security-header-expect-staple/ [17:53:20] 10Operations, 10Traffic, 10HTTPS: Consider adding Must-Staple header to enforce revocation checking - https://phabricator.wikimedia.org/T204987 (10Krenair) [17:53:46] 10Operations, 10Traffic: Consider adding Must-Staple header to enforce revocation checking - https://phabricator.wikimedia.org/T204987 (10Krenair) [17:53:56] (03PS3) 10Dzahn: site: add mediawiki_maintenance role to mwmaint1002 [puppet] - 10https://gerrit.wikimedia.org/r/461491 (https://phabricator.wikimedia.org/T201343) [17:54:55] 10Operations, 10Traffic: Puppetise OCSP stapling for all one-off HTTPS servers - https://phabricator.wikimedia.org/T204992 (10Krenair) [17:56:36] subbu: how is that VM working out for you? Have you moved any load off of promethium yet? [17:58:53] 10Operations, 10Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993 (10Krenair) [17:59:31] !log reinstalling multatuli [17:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:41] (03PS1) 10RobH: correcting an-master100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/461698 (https://phabricator.wikimedia.org/T201939) [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Morning SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T1800). [18:00:04] bmansurov, stephanebisson, arlolra, and AndyRussG: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] here [18:00:42] (03CR) 10RobH: [C: 032] correcting an-master100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/461698 (https://phabricator.wikimedia.org/T201939) (owner: 10RobH) [18:00:57] andrewbogott, i've not done a thing yet .. was away at wikilead and then offsite and doing a bunch of followup post-offsite and catch up after ignoring things for 2 weeks .. but, maybe tomorrow @ mayday, i can kick this off. [18:01:20] hello [18:01:36] 10Operations, 10Traffic: Integrate certspotter with certcentral to avoid certspotter notifying us on legitimate certs generated by our certcentral boxes - https://phabricator.wikimedia.org/T204994 (10Krenair) [18:01:36] subbu: sounds good! [18:02:04] (03CR) 10Dzahn: [C: 032] site: add mediawiki_maintenance role to mwmaint1002 [puppet] - 10https://gerrit.wikimedia.org/r/461491 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [18:02:30] andrewbogott, maybe tomorrow i can also start looking at those 2 labs vms that need to be upgraded to jessie. [18:03:09] 10Operations, 10Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993 (10Krenair) [18:03:12] 10Operations, 10Traffic: Integrate certspotter with certcentral to avoid certspotter notifying us on legitimate certs generated by our certcentral boxes - https://phabricator.wikimedia.org/T204994 (10Krenair) [18:03:15] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) [18:03:21] * subbu stops chattering here so swatters can do their thing [18:05:57] 10Operations, 10Traffic: Integrate certspotter with certcentral to avoid certspotter notifying us on legitimate certs generated by our certcentral boxes - https://phabricator.wikimedia.org/T204994 (10Krenair) We'd still get stuff being issued from *.corp.wikimedia.org and frack but these are all manual AFAIK (... [18:06:45] Anybody wants to swat today? [18:07:19] (03CR) 1020after4: [C: 031] Don't drop the colon between hash type/digest [software/keyholder] - 10https://gerrit.wikimedia.org/r/458229 (owner: 10Faidon Liambotis) [18:07:46] (03CR) 1020after4: [C: 031] Only show tracebacks on DEBUG logging levels [software/keyholder] - 10https://gerrit.wikimedia.org/r/458230 (owner: 10Faidon Liambotis) [18:08:12] I can SWAT [18:08:16] (03PS1) 10RobH: fixing typo for an-master1001 [dns] - 10https://gerrit.wikimedia.org/r/461700 [18:08:19] :) [18:09:06] 10Operations, 10Traffic: Consider adding expect-CT: header to enforce certificate transparency - https://phabricator.wikimedia.org/T193521 (10Krenair) This was discussed in #wikimedia-traffic today. Even though theoretically the header would be useless past 2021-06-01 (when the last publicly trusted certs issu... [18:09:21] (03CR) 10RobH: [C: 032] fixing typo for an-master1001 [dns] - 10https://gerrit.wikimedia.org/r/461700 (owner: 10RobH) [18:10:37] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461117 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [18:11:42] (03Merged) 10jenkins-bot: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461117 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [18:11:57] 10Operations, 10Traffic: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10Krenair) [18:12:13] 10Operations, 10Traffic: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10Krenair) [18:12:15] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) [18:13:33] bmansurov: your change is on mwdebug2002, check please (if possible) [18:13:39] thcipriani: My 2 PageTriage patches can easily be tested together if you want to proceed like that. [18:13:40] thcipriani: checking [18:14:06] stephanebisson: cool, thanks for letting me know, I'll do that when they merge [18:14:39] (03PS3) 10Thcipriani: Set $wgSiteMatrixNonGlobalSites global for SiteMatrix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460034 (owner: 10Arlolra) [18:15:17] thcipriani: did you mean mwdebug1002? [18:15:19] thcipriani: thx! [18:15:47] I don't see the mwdebug2002 option [18:16:25] hrm, yeah now that you mention it I don't either [18:17:39] bmansurov: In the browser plug-in it's mislabelled. [18:17:52] James_F: what should it be? [18:17:53] (03CR) 10jenkins-bot: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461117 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [18:18:04] mwdebug2001 is mw2017 and mwdebug2002 is mw2099. [18:18:11] James_F: thanks! [18:18:18] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:18:19] James_F: ah ha! thank you! [18:18:21] K.renair pushed a fix but it's not been released. [18:18:28] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 1.172e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [18:18:50] bmansurov: so mw2099 is where it's pulled (according to the plugin :)) [18:19:15] thcipriani: ok, please go ahead with deploying the patch to everywhere, it's working. [18:19:24] great! will do [18:21:52] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:461117|Enable logging for CitationUsage and CitationUsagePageLoad]] T191086 (duration: 00m 51s) [18:21:57] bmansurov: ^ live now [18:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:01] T191086: Instrument and collect data via CitationUsage schema - https://phabricator.wikimedia.org/T191086 [18:22:06] thcipriani: thank you! [18:23:45] (03CR) 10Dzahn: [C: 031] Add Matt to maps-admin [puppet] - 10https://gerrit.wikimedia.org/r/461642 (https://phabricator.wikimedia.org/T204960) (owner: 10Mathew.onipe) [18:24:48] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:26:44] stephanebisson: your changes are live on mw2099/mwdebug2002, check please [18:26:59] thcipriani: testing now... [18:28:00] 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-Access-Requests, 10Patch-For-Review: Add Matt to restricted group so he can access script nodes - https://phabricator.wikimedia.org/T204980 (10RobH) [18:28:34] thcipriani: works as expected [18:28:48] ok, going live [18:29:05] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Access to Maps servers - https://phabricator.wikimedia.org/T204960 (10RobH) p:05Triage>03Normal a:05Dzahn>03None [18:31:46] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/PageTriage/modules: SWAT: [[gerrit:461424|Correctly sync the form when afc_state === "all"]] T204629 [[gerrit:461648|Use same api params for list and stats on page load]] T204629 (duration: 00m 54s) [18:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:55] T204629: AfC: Issues with reloading the page - https://phabricator.wikimedia.org/T204629 [18:31:56] ^ stephanebisson live now [18:32:08] thcipriani: thank you! [18:32:17] yw :) [18:32:27] arlolra: ping for SWAT [18:32:33] here [18:32:52] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460034 (owner: 10Arlolra) [18:33:01] great :) [18:34:14] (03Merged) 10jenkins-bot: Set $wgSiteMatrixNonGlobalSites global for SiteMatrix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460034 (owner: 10Arlolra) [18:35:16] 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-Access-Requests, 10Patch-For-Review: Add Matt to restricted group so he can access script nodes - https://phabricator.wikimedia.org/T204980 (10RobH) p:05Triage>03Normal [18:35:48] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:35:56] 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-Access-Requests, 10Patch-For-Review: add onimisionipe to restricted group - https://phabricator.wikimedia.org/T204980 (10RobH) [18:36:14] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: add onimisionipe to maps-admin - https://phabricator.wikimedia.org/T204960 (10RobH) [18:36:20] arlolra: your change is live on mw2099, check please [18:36:24] (if possible) [18:37:02] umm, how do I access that machine? [18:37:11] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: add onimisionipe to maps-admin - https://phabricator.wikimedia.org/T204960 (10RobH) [18:37:52] arlolra: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Staging_changes [18:37:53] arlolra: there's a special browser plug-i [18:37:56] in [18:37:58] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:38:43] 10Operations, 10Analytics, 10User-Elukey: rack/setup/install an-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10RobH) a:05RobH>03None [18:38:49] well that page isn't helpful [18:38:55] * thcipriani digs [18:38:58] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is CRITICAL: 54.62 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:39:03] arlolra: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions [18:39:17] install the extension in chrome or firefox [18:39:23] * arlolra regrets not coming prepared [18:39:40] AndyRussG: ah, thanks, I linked to the wrong section :) [18:40:00] thcipriani: heheh it took me a few minutes to find the link.... [18:40:10] 10Operations, 10ops-eqiad: update label on an-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T204999 (10RobH) p:05Triage>03Low [18:40:39] ok, I just did [18:40:49] curl -H "X-Wikimedia-Debug: mwdebug2002.codfw.wmnet" https://en.wikipedia.org/w/api.php?action=sitematrix [18:41:04] and got [18:41:11] { [18:41:12] "url": "https://wikitech.wikimedia.org", [18:41:12] "dbname": "labswiki", [18:41:12] "code": "labs", [18:41:12] "sitename": "Wikipedia", [18:41:14] "nonglobal": "" [18:41:16] }, [18:41:20] which is what I expected [18:41:23] so, good to go [18:42:02] 10Operations, 10ops-eqiad, 10Analytics: setup/install analytics-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) [18:42:18] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is OK: (C)60 le (W)70 le 72.86 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:42:38] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 6122 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [18:42:40] arlolra: awesome, thanks for checking, going live [18:43:01] !log Initiating in-place reindex for wikidatawiki (T147505) [18:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:09] T147505: [Recurring task] CirrusSearch: what is updated during re-indexing - https://phabricator.wikimedia.org/T147505 [18:45:01] !log thcipriani@deploy1001 scap failed: average error rate on 6/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [18:45:04] robh: Is there any way I can apply for +2 permissions on the puppet repo so I don't have to bug guillaume et al. for +2 on changes to profiles/roles that I maintain for my team's labs VMs? e.g. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458907/ [18:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:18] jouncebot: now [18:45:18] For the next 0 hour(s) and 14 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T1800) [18:45:20] jouncebot: next [18:45:20] In 0 hour(s) and 14 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T1900) [18:45:45] bearloga: As far as I know, no one without root has +2 in operations/puppet [18:45:56] bearloga: You'll need to file an access request for thaat... because I guess you'll also need root/similar to be able to deploy it.. [18:46:01] So while you can petition for it, all past petitions have been rejected. The reasoning being that repo can take down the entire site. [18:46:04] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Very interesting thing (that may be a red herring): ``` elukey@neodymium:~$ sudo cu... [18:46:38] not to mention puppet commits are effectively root-level access anyways [18:46:40] arlolra: deployment failed saw a lot of Undefined variable: wgSiteMatrixNonGlobalSites in /srv/mediawiki/wmf-config/CommonSettings.php on line 1543 [18:46:42] If you do wish to file an access request though, you'll want to file it in #sre-access-requests [18:47:03] though I would recommend you set your expectations accordingly, as I've never seen it granted to a non root and don't expect to. [18:47:03] (03CR) 10jenkins-bot: Set $wgSiteMatrixNonGlobalSites global for SiteMatrix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460034 (owner: 10Arlolra) [18:47:04] (as puppet commits can define commands which run as root, etc... or for that matter expose the private repo) [18:47:21] (plus all what bblack is stating!) [18:47:47] c'mon jerkins, I wanna deploy something [18:47:48] robh: thanks! yeah, I doubt I'd get it so I won't try :) [18:47:57] im not sure why you need +2 for labs vms though [18:48:04] my understanding is the point of labs vms is so you dont need it? [18:48:22] but yeah, its a really hard/impossible request to grant to non roots. [18:48:29] due to the level of access that repo grants. [18:49:01] (sorry to sound super negative) [18:49:18] thcipriani: oh! looking [18:49:19] robh: no problem! I completely understand [18:50:17] arlolra: here's a trace https://phabricator.wikimedia.org/P7575 [18:50:51] arlolra: I'm going to revert for the time being just because we're coming up on the end of the window [18:51:01] no problem [18:51:07] robh: as far as I know wmcs team doesn't have a separate puppet repo for roles that people would want to use in labs vms [18:51:20] robh, you don't *need* it as you can just have a puppetmaster and cherry-pick everything there [18:51:35] or just disable puppet and do whatever you like [18:51:41] neither of which are advisable [18:51:46] yeah, but it becomes a pain i can understand that [18:51:58] (03PS1) 10Thcipriani: Revert "Set $wgSiteMatrixNonGlobalSites global for SiteMatrix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461706 [18:52:00] so its a soft blocker, it can be worked around but its a pita [18:52:04] thcipriani: i see the problem [18:52:06] :( [18:52:26] if there's time for a fix instead [18:52:29] in deployment-prep I just cherry-pick stuff while it's open for review [18:52:42] and normally develop new puppet patches there [18:53:15] arlolra: sure if you've got a patch [18:53:22] yup, pushing [18:53:39] !log thcipriani@deploy1001 Synchronized wmf-config/CommonSettings.php: Revert on canaries: SWAT: [[gerrit:460034|Set $wgSiteMatrixNonGlobalSites global for SiteMatrix]] (duration: 00m 53s) [18:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:48] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/DisableAccount/: maintenance script (duration: 00m 52s) [18:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:09] (03PS1) 10Arlolra: List $wgSiteMatrixNonGlobalSites as global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461707 [18:55:14] thcipriani: ^ [18:56:13] ah, ok [18:56:37] (03CR) 10Reedy: [C: 031] List $wgSiteMatrixNonGlobalSites as global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461707 (owner: 10Arlolra) [18:56:42] need moar static analysis [18:57:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461707 (owner: 10Arlolra) [18:57:46] Sorry about that [18:57:56] shit happens :) [18:58:04] arlolra: Go see bd808 for a t-shirt [18:58:33] (03Abandoned) 10Thcipriani: Revert "Set $wgSiteMatrixNonGlobalSites global for SiteMatrix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461706 (owner: 10Thcipriani) [18:58:40] :) [18:58:51] I wish [18:59:25] (03Merged) 10jenkins-bot: List $wgSiteMatrixNonGlobalSites as global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461707 (owner: 10Arlolra) [18:59:44] 10Operations, 10Traffic: Consider adding expect-CT: header to enforce certificate transparency - https://phabricator.wikimedia.org/T193521 (10Krenair) we need to audit that all our current LE certs (issued via the old system) have in fact renewed since LE started embedding SCT, and that they have it (... [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T1900) [19:00:50] Reedy, maintenance script for DisableAccount? [19:01:01] Krenair: Gonna get the damned thing undeployed [19:01:06] yay [19:01:38] (03CR) 10jenkins-bot: List $wgSiteMatrixNonGlobalSites as global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461707 (owner: 10Arlolra) [19:02:25] !log thcipriani@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:460034|Set $wgSiteMatrixNonGlobalSites global for SiteMatrix]] [[gerrit:461707|List $wgSiteMatrixNonGlobalSites as global]] (duration: 00m 52s) [19:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:33] ^ arlolra live now [19:02:49] Many thanks [19:02:55] sure, thanks for the fix :) [19:03:54] AndyRussG: your change is on mw2009, check please [19:04:23] thcipriani: ok checking! [19:06:04] !log reedy@deploy1001 Pruned MediaWiki: 1.32.0-wmf.4 [keeping static files] (duration: 01m 58s) [19:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:30] bleugh, I needed a delete [19:06:58] * Reedy waits for swat to actually be finished [19:07:11] last patch [19:08:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10RobH) 05Open>03declined [19:08:20] thcipriani: looks good! [19:08:26] AndyRussG: ok, going live [19:08:34] thcipriani: okok thx much!! :) [19:09:22] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Also interesting https://librenms.wikimedia.org/device/device=94/tab=port/port=8564/... [19:10:13] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RobH) I've emailed @RStallman-legalteam directly requesting nda on file confirmation. (Just echoing here so folks... [19:10:46] 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-Access-Requests, 10Patch-For-Review: add onimisionipe to restricted group - https://phabricator.wikimedia.org/T204980 (10Gehel) I can vouch for @Mathew.onipe, but his manager is officially @EBjune. [19:11:39] thcipriani: also looks great live... thx so much! [19:11:49] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: add onimisionipe to maps-admin - https://phabricator.wikimedia.org/T204960 (10RobH) @EBjune: can we get your approval as manager for this expansion of shell access rights? [19:11:54] 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-Access-Requests, 10Patch-For-Review: add onimisionipe to restricted group - https://phabricator.wikimedia.org/T204980 (10RobH) @EBjune: can we get your approval as manager for this expansion of shell access rights? [19:12:02] 10Operations, 10Discovery-Search, 10Elasticsearch, 10SRE-Access-Requests, 10Patch-For-Review: add onimisionipe to restricted group - https://phabricator.wikimedia.org/T204980 (10RobH) a:03EBjune [19:12:09] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: add onimisionipe to maps-admin - https://phabricator.wikimedia.org/T204960 (10RobH) a:03EBjune [19:12:10] AndyRussG: it's on the way now :) [19:12:26] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/CentralNotice/resources/subscribing/ext.centralNotice.display.js: SWAT: [[gerrit:461485|Add performance mark for when banner is inserted]] T195840 (duration: 00m 51s) [19:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:33] T195840: Track when a CentralNotice banner was displayed to the user in NavTiming - https://phabricator.wikimedia.org/T195840 [19:12:34] ^ AndyRussG live now [19:12:39] Reedy: all clear [19:12:46] sweet [19:13:54] !log reedy@deploy1001 clean aborted: Pruned MediaWiki: 1.32.0-wmf.4 (duration: 01m 03s) [19:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:02] Hmm [19:14:06] thcipriani: fun... heh I did see it live, x debug extension turned off, before [19:14:14] That might've been my fault [19:14:15] I guess I was just luck to hit a server that scap had already gotten to [19:15:38] anyway yea looks all good [19:15:41] lucky [19:18:06] 10Operations, 10Cleanup, 10GitHub-Mirrors, 10OCG-General, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10MarcoAurelio) [19:19:51] !log reboot multatuli [19:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:05] !log reedy@deploy1001 Pruned MediaWiki: 1.32.0-wmf.14 [keeping static files] (duration: 02m 03s) [19:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:58] (03PS1) 10BBlack: multatuli: add to authdns server set [puppet] - 10https://gerrit.wikimedia.org/r/461712 [19:22:07] !log reedy@deploy1001 clean aborted: Pruned MediaWiki: 1.32.0-wmf.14 (duration: 00m 30s) [19:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:28] thcipriani: Is there a better way than entering my gerrit username/password many times? [19:22:52] !log Add email to account Abalg~commonswiki [19:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:25] Reedy: .netrc with a token https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Setup [19:23:39] PROBLEM - Host multatuli is DOWN: PING CRITICAL - Packet loss = 100% [19:23:50] RECOVERY - Host multatuli is UP: PING OK - Packet loss = 0%, RTA = 83.80 ms [19:26:02] cheers [19:26:06] 10Operations, 10Cleanup, 10GitHub-Mirrors, 10OCG-General, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10MarcoAurelio) [19:26:50] PROBLEM - Check systemd state on multatuli is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:26:59] PROBLEM - Auth DNS on multatuli is CRITICAL: CRITICAL - Plugin timed out while executing system call [19:27:45] lol [19:27:46] OSError: [Errno 13] Permission denied: '/srv/mediawiki-staging/php-1.32.0-wmf.14/.git/modules/extensions/DonationInterface/objects/f1/3405f7e1ec502a04ffabf37d426227930c0010' [19:27:46] 19:27:30 clean failed: [Errno 13] Permission denied: '/srv/mediawiki-staging/php-1.32.0-wmf.14/.git/modules/extensions/DonationInterface/objects/f1/3405f7e1ec502a04ffabf37d426227930c0010' [19:27:49] No log [19:28:13] tgr, your umask again causing issues :P [19:28:56] probably best for now to use: https://phabricator.wikimedia.org/T200690#4467351 [19:28:57] tgr@deploy1001:~$ umask [19:28:58] 0002 [19:29:11] tgr: it's from july [19:29:15] So possibly fixed [19:29:32] oh, well that looks right. do you need any chmod ? [19:29:46] oh, wmf.14 [19:29:46] can you rm -rf /srv/mediawiki-staging/php-1.32.0-wmf.14/.git [19:30:34] done [19:30:55] sorry about that :/ [19:31:07] :) [19:31:37] I'll search for more readonly stuff [19:32:24] RECOVERY - Check systemd state on multatuli is OK: OK - running: The system is fully operational [19:32:59] !log reedy@deploy1001 Pruned MediaWiki: 1.32.0-wmf.14 (duration: 02m 05s) [19:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:15] RECOVERY - Auth DNS on multatuli is OK: DNS OK: 0.093 seconds response time. www.wikipedia.org returns 208.80.154.224 [19:34:50] (03CR) 10BBlack: [C: 032] multatuli: add to authdns server set [puppet] - 10https://gerrit.wikimedia.org/r/461712 (owner: 10BBlack) [19:36:21] !log reedy@deploy1001 Pruned MediaWiki: 1.32.0-wmf.15 (duration: 03m 16s) [19:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:51] there are heaps of them, apparently, lots of different users [19:37:03] what broke, exactly? [19:37:10] trying to remove it, and purge it from servers [19:37:49] well, if that's going to be a problem, an op should look at it since a lot of those files are not from me [19:38:20] examples? :) [19:39:31] https://phabricator.wikimedia.org/P7576 [19:40:09] at a glance they all seem to be in .git but didn't really verify [19:40:20] Reedy: thank you for the cleanup! [19:40:54] np :) [19:40:58] Filed a few bugs for you too :P [19:41:19] eqiad is read-only :) [19:41:33] (03PS1) 10Sbisson: Duplicate wp10 configuration into articlequality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461715 (https://phabricator.wikimedia.org/T203080) [19:41:44] Krinkle: Thanks, I noticed ;) [19:41:48] :P [19:43:14] * Krinkle updates host exclude in logstash to also add deploy1* in adition to mw1*, mwdebug1*,labweb1* [19:45:53] (03PS1) 10Reedy: Revert "Disable DisableAccount on wikis where there are no disabled users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461720 [19:45:58] (03PS2) 10Reedy: Revert "Disable DisableAccount on wikis where there are no disabled users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461720 [19:46:31] (03PS3) 10Reedy: Revert "Disable DisableAccount on wikis where there are no disabled users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461720 [19:46:33] Sanity... [19:47:34] (03CR) 10Reedy: [C: 032] Revert "Disable DisableAccount on wikis where there are no disabled users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461720 (owner: 10Reedy) [19:48:54] (03Merged) 10jenkins-bot: Revert "Disable DisableAccount on wikis where there are no disabled users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461720 (owner: 10Reedy) [19:49:43] > PHP Notice: Undefined variable: wgSiteMatrixNonGlobalSites [19:49:46] Reedy: related? [19:50:05] Krinkle: That was a bad patch that was swatted... Should be fixed now though [19:50:10] k, thanks [19:50:13] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Temporarily re-enable DisableAccount on wikis it was previous enabled (duration: 00m 51s) [19:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:11] ah I see, was caught by canaries. Nice. [19:51:25] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 1.116e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [19:51:27] I mean , not super nice, it should've been caught by the localhost check. [19:51:28] (03PS1) 10Gilles: Define Haproxy Prometheus jobs [puppet] - 10https://gerrit.wikimedia.org/r/461724 (https://phabricator.wikimedia.org/T187765) [19:51:30] But better than full prod [19:51:58] (03PS1) 10Reedy: Disable DisableAccount on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461726 (https://phabricator.wikimedia.org/T106067) [19:53:14] (03CR) 10Ladsgroup: [C: 04-1] "1- I think this is better to be split to two patches, one labs and one for prod. The labs one can get merged and deployed at any time (I c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461715 (https://phabricator.wikimedia.org/T203080) (owner: 10Sbisson) [19:53:16] (03CR) 10Reedy: [C: 032] Disable DisableAccount on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461726 (https://phabricator.wikimedia.org/T106067) (owner: 10Reedy) [19:54:25] (03PS1) 10Elukey: Add IPv6 interface to an-master100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/461729 (https://phabricator.wikimedia.org/T201939) [19:54:42] (03Merged) 10jenkins-bot: Disable DisableAccount on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461726 (https://phabricator.wikimedia.org/T106067) (owner: 10Reedy) [19:55:34] Is there a json file in wmf-config extensions need removing from? [19:55:36] Something sounds fmailiar [19:55:53] (03PS1) 10Reedy: Remove DisableAccount from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461731 (https://phabricator.wikimedia.org/T106067) [19:55:55] (03PS1) 10Reedy: Remove DisableAccount from InitialiseSettings.php and extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461732 (https://phabricator.wikimedia.org/T106067) [19:55:58] James_F: Look! I split it into two patches! [19:56:16] Two not three, though. ;-) [19:56:30] I'm not splitting IS and extension-list [19:56:37] extension-list is basically a noop :P [19:57:03] Reedy: I deleted the JSON file a few weeks ago when it became apparent that we hadn't actually ever been using it, except for a never-used branch of scap. Whoops. :-) [19:57:07] Fair. [19:57:11] hahah [19:57:13] Fine :) [19:57:19] I wasn't going completely crazy then [19:57:49] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable DisableAccount T106067 (duration: 00m 51s) [19:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:59] T106067: Undeploy DisableAccount extension - https://phabricator.wikimedia.org/T106067 [19:57:59] Do I undeploy it now, or wait a few days? [19:58:06] (03PS1) 10Sbisson: Labs: Duplicate wp10 configuration into articlequality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461734 (https://phabricator.wikimedia.org/T203080) [19:58:09] hashar: 1.32.0-wmf.22 is everywhere now? [19:58:19] cscott: Has been for a few hours :) [19:58:29] i'm not sure what mediawiki train - european version vs mediawiki train - americas version means [19:58:30] http://tools.wmflabs.org/versions/ [19:58:33] Reedy: f3d26dcdb specifically [19:58:42] cscott: Basically, depends who's doing the train [19:58:55] If it's a european, it's the european slots [19:59:00] If it's an american, it's the american slots [19:59:12] Ideally we would delete the unused slots. [19:59:13] oh, ok. i scheduled a patch for the evening swat because i needed the train to go through first, if i'd been paying more attention i guess i could have done it earlier today [19:59:43] The window is pretty much empty if you need to get something out [19:59:48] (03CR) 10jerkins-bot: [V: 04-1] Labs: Duplicate wp10 configuration into articlequality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461734 (https://phabricator.wikimedia.org/T203080) (owner: 10Sbisson) [20:00:23] Reedy: https://gerrit.wikimedia.org/r/c/443645/ from today's evening swat list? [20:00:46] (03CR) 10Sbisson: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461734 (https://phabricator.wikimedia.org/T203080) (owner: 10Sbisson) [20:00:55] !log elasticsearch codfw cluster restart for new systemd unit completed [20:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:07] ooh [20:01:12] (03PS7) 10Reedy: Remove $wgUseTidy and $wgTidyConfig from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [20:01:21] cscott: Well nerd-sniped. ;-) [20:01:58] turns out i've got an open house for my son during the evening swat window, and i'm too impatient to wait until monday [20:02:39] (03Abandoned) 10Reedy: Remove wgTidyConfig; same as DefaultSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444775 (owner: 10Reedy) [20:02:49] Reedy: ah! [20:02:54] snap :D [20:03:41] maybe you'll like https://gerrit.wikimedia.org/r/460202 too, though that has a dependent patch to core which needs to ride the train first (but don't let me distract you) [20:04:23] (03CR) 10Reedy: [C: 032] Remove $wgUseTidy and $wgTidyConfig from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [20:04:45] (03CR) 10Sbisson: "> 1- I think this is better to be split to two patches, one labs and" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461715 (https://phabricator.wikimedia.org/T203080) (owner: 10Sbisson) [20:05:48] (03Merged) 10jenkins-bot: Remove $wgUseTidy and $wgTidyConfig from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [20:06:31] (03CR) 10Ayounsi: [C: 032] SNMP: set snmp-mibs-downloader BASEDIR to Debian 9 standard [puppet] - 10https://gerrit.wikimedia.org/r/461503 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [20:06:49] (03CR) 10jenkins-bot: Revert "Disable DisableAccount on wikis where there are no disabled users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461720 (owner: 10Reedy) [20:06:51] (03CR) 10jenkins-bot: Disable DisableAccount on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461726 (https://phabricator.wikimedia.org/T106067) (owner: 10Reedy) [20:06:53] (03CR) 10jenkins-bot: Remove $wgUseTidy and $wgTidyConfig from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [20:06:59] (03PS2) 10Ayounsi: SNMP: set snmp-mibs-downloader BASEDIR to Debian 9 standard [puppet] - 10https://gerrit.wikimedia.org/r/461503 (https://phabricator.wikimedia.org/T83992) [20:07:13] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bye bye tidy config! (duration: 00m 50s) [20:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:10] cscott: Yeah, I'm not backporting that core change today ;) [20:09:18] Reedy: https://en.wikipedia.org/wiki/User:Cscott/TidyTest seems to indicate that remex is still in use, so we haven't massively broken anything AFAICT [20:10:20] Reedy: but its such a little bitty patch to backport ;) [20:10:37] I'll put it on the swat schedule for next week, same bat time same bat place [20:11:12] cscott: But https://gerrit.wikimedia.org/r/c/mediawiki/core/+/442104 is now good to merge, right? [20:11:56] yes indeed! and i think some others built on that one. [20:12:33] https://gerrit.wikimedia.org/r/442107 is already C+2'ed, it should merge with 442104 [20:14:30] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10hashar) Great! `\o/` [20:22:08] !log Started wikidata reindex again, hopefully better luck this time [20:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:42] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RStallman-legalteam) Confirming for legal that the contract is enough. No need to sign an additional NDA. I'll add... [20:25:15] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 9123 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [20:28:46] Cirrus search lag when consuming from the related topics --^ [20:34:05] (03PS2) 10Jforrester: Cleanup: Drop old comment about zhwiki priv changes from 2010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450886 [20:34:07] (03PS2) 10Jforrester: Cleanup: Drop old comment for a global rollback group that doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450887 [20:34:09] (03PS2) 10Jforrester: Cleanup: Drop old comment for a global developer group that doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450888 [20:34:11] (03PS2) 10Jforrester: Cleanup: Drop old comments for general user access to FlaggedRevs on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450889 [20:34:13] (03PS2) 10Jforrester: Cleanup: Drop old comment for khmwikt [sic] import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450890 [20:36:20] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RobH) a:05RStallman-legalteam>03None [20:36:46] (03PS2) 10Jforrester: Follow-up 0629eb9: Fix outdated reference to user group name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455624 [20:36:57] jouncebot: next [20:36:58] In 2 hour(s) and 23 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180920T2300) [20:37:06] 10Operations, 10SRE-Access-Requests: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RobH) [20:37:15] Anyone mind if I push my comments-only clean-up patches? :-) [20:41:32] (03PS1) 10RobH: adding shell user nathante [puppet] - 10https://gerrit.wikimedia.org/r/461760 (https://phabricator.wikimedia.org/T204790) [20:43:49] (03PS1) 10RobH: adding nathante to groups [puppet] - 10https://gerrit.wikimedia.org/r/461761 (https://phabricator.wikimedia.org/T204790) [20:44:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RobH) [20:44:27] James_F: Are tehy comments full of lies? [20:45:10] Yes, hence my clean-up. [20:47:55] (03CR) 10Jforrester: [C: 032] Cleanup: Drop old comment about zhwiki priv changes from 2010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450886 (owner: 10Jforrester) [20:48:03] (03CR) 10Jforrester: [C: 032] Cleanup: Drop old comment for a global rollback group that doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450887 (owner: 10Jforrester) [20:48:13] (03CR) 10Jforrester: [C: 032] Cleanup: Drop old comment for a global developer group that doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450888 (owner: 10Jforrester) [20:48:22] (03CR) 10Jforrester: [C: 032] Cleanup: Drop old comments for general user access to FlaggedRevs on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450889 (owner: 10Jforrester) [20:48:32] (03CR) 10Jforrester: [C: 032] Cleanup: Drop old comment for khmwikt [sic] import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450890 (owner: 10Jforrester) [20:48:43] (03CR) 10Jforrester: [C: 032] Follow-up 0629eb9: Fix outdated reference to user group name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455624 (owner: 10Jforrester) [20:49:07] (03Abandoned) 10Jforrester: [WIP] Let Wikidata editors edit at a higher rate than on other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280003 (owner: 10Jforrester) [20:49:45] (03Merged) 10jenkins-bot: Cleanup: Drop old comment about zhwiki priv changes from 2010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450886 (owner: 10Jforrester) [20:49:49] (03Merged) 10jenkins-bot: Cleanup: Drop old comment for a global rollback group that doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450887 (owner: 10Jforrester) [20:50:26] (03Merged) 10jenkins-bot: Cleanup: Drop old comment for a global developer group that doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450888 (owner: 10Jforrester) [20:50:29] (03Merged) 10jenkins-bot: Cleanup: Drop old comments for general user access to FlaggedRevs on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450889 (owner: 10Jforrester) [20:50:39] (03Merged) 10jenkins-bot: Cleanup: Drop old comment for khmwikt [sic] import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450890 (owner: 10Jforrester) [20:50:51] (03Merged) 10jenkins-bot: Follow-up 0629eb9: Fix outdated reference to user group name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455624 (owner: 10Jforrester) [20:51:30] (03CR) 10Ayounsi: "This actually didn't solve the issue, digging more, /etc/smi.conf in Debian 8 defines different paths than the same file in Debian 9. Path" [puppet] - 10https://gerrit.wikimedia.org/r/461503 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [20:54:04] (03CR) 10jenkins-bot: Cleanup: Drop old comment about zhwiki priv changes from 2010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450886 (owner: 10Jforrester) [20:54:06] (03CR) 10jenkins-bot: Cleanup: Drop old comment for a global rollback group that doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450887 (owner: 10Jforrester) [20:54:08] (03CR) 10jenkins-bot: Cleanup: Drop old comment for a global developer group that doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450888 (owner: 10Jforrester) [20:54:10] (03CR) 10jenkins-bot: Cleanup: Drop old comments for general user access to FlaggedRevs on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450889 (owner: 10Jforrester) [20:54:12] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Comment clean-up no-op (duration: 00m 51s) [20:54:12] (03CR) 10jenkins-bot: Cleanup: Drop old comment for khmwikt [sic] import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450890 (owner: 10Jforrester) [20:54:14] (03CR) 10jenkins-bot: Follow-up 0629eb9: Fix outdated reference to user group name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455624 (owner: 10Jforrester) [20:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:17] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Comment clean-up no-op (duration: 00m 50s) [20:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:44] All done. [20:57:33] 10Operations, 10ops-eqiad, 10Analytics: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) [20:59:07] 10Operations, 10ops-eqiad: apply hostname labels to an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T205034 (10RobH) p:05Triage>03Low [21:07:13] 10Operations, 10ops-eqiad, 10Analytics: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) ``` robh@asw2-b-eqiad# show | compare [edit interfaces interface-range disabled] - member ge-1/0/7; [edit interfaces interface-range vlan-private1-b-eqiad] - member ge-1/0/7;... [21:09:29] 10Operations, 10ops-eqiad, 10Analytics: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) [21:13:16] (03PS1) 10RobH: an-coord1001 dns updates [dns] - 10https://gerrit.wikimedia.org/r/461771 (https://phabricator.wikimedia.org/T204970) [21:15:23] (03CR) 10RobH: [C: 032] an-coord1001 dns updates [dns] - 10https://gerrit.wikimedia.org/r/461771 (https://phabricator.wikimedia.org/T204970) (owner: 10RobH) [21:16:33] !log 1.32.0-wmf.22 is fully deployed. A quick summary and thanks words are at https://phabricator.wikimedia.org/T191068#4604040 [21:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:27] !log releases2001 - manually rsync releases files from releases1001 [21:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:10] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10RobH) a:05RobH>03Cmjohnson @Cmjohnson: It appears this system had dns added for mgmt, and is racked, but its mgmt is not online. I've attempted to ping it dire... [21:26:14] !log releases1001 - deleting cronjob remnants for releases from bromine [21:26:14] !log releases1001 - rm /usr/local/sbin/sync-srv-org-wikimedia-releases (bromine remnants) [21:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:52] (03PS1) 10Gilles: Upgrade to 2.2 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/461793 (https://phabricator.wikimedia.org/T20871) [21:36:34] 10Operations: releases servers: set rsync direction based on active dc, add warning motd on inactive server - https://phabricator.wikimedia.org/T205037 (10Dzahn) [21:37:35] 10Operations: releases servers: set rsync direction based on active dc, add warning motd on inactive server - https://phabricator.wikimedia.org/T205037 (10Dzahn) [21:44:01] 10Operations: releases servers: set rsync direction based on active dc, add warning motd on inactive server - https://phabricator.wikimedia.org/T205037 (10Dzahn) But i still need the actual host names here to pass to rsync::quickdatacopy as source_host and dest_host. It's not enough to just have an $enabled true... [21:46:08] 10Operations: releases servers: set rsync direction based on active dc, add warning motd on inactive server - https://phabricator.wikimedia.org/T205037 (10Dzahn) We can say "if in primary_dc then source_host is $fqdn" but still need a dest host. How to determine it, just replacing "1" with "2" in the name is al... [21:55:01] 10Operations, 10Community-Tech, 10MediaWiki-Parser, 10Traffic: Show SVGs in in wiki language if available - https://phabricator.wikimedia.org/T205040 (10MaxSem) [21:58:50] (03PS1) 10Dzahn: make more obvious which rsyncs are automatic [puppet] - 10https://gerrit.wikimedia.org/r/461809 (https://phabricator.wikimedia.org/T205037) [22:05:42] (03CR) 10Dzahn: [C: 032] make more obvious which rsyncs are automatic [puppet] - 10https://gerrit.wikimedia.org/r/461809 (https://phabricator.wikimedia.org/T205037) (owner: 10Dzahn) [22:10:16] James_F: Do we undeploy DisableAccount today? Or wait a bit? :P [22:10:52] Reedy: Wait for complaints over the weekend, then de-deploy on Monday? [22:11:04] Sounds good to me [22:11:14] If we can stop it being in the next train (Y) [22:11:46] Oh, hmm. [22:12:05] Yeah, need to either kill it before the train goes or do some fancy l10n footwork. [22:12:44] No train on monday, so should be good [22:13:22] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) role applied, now blocked on puppet error due to missing cert for mcrouter in the private repo. How to generate these certs isn't documented yet. working on changing... [22:14:55] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 25713 MB (5% inode=99%) [22:18:15] RECOVERY - Disk space on elastic1019 is OK: DISK OK [22:19:04] PROBLEM - Host ms-be2030 is DOWN: PING CRITICAL - Packet loss = 100% [22:29:41] (03PS1) 10Dzahn: releases: add warning motd on secondary server [puppet] - 10https://gerrit.wikimedia.org/r/461810 (https://phabricator.wikimedia.org/T205037) [22:30:17] (03CR) 10jerkins-bot: [V: 04-1] releases: add warning motd on secondary server [puppet] - 10https://gerrit.wikimedia.org/r/461810 (https://phabricator.wikimedia.org/T205037) (owner: 10Dzahn) [22:36:45] 10Operations, 10Patch-For-Review: releases servers: set rsync direction based on active dc, add warning motd on inactive server - https://phabricator.wikimedia.org/T205037 (10Dzahn) eh.. class profile::releases::reprepro also included in the same role along the other profiles, already has code for this too.. b... [22:41:54] RECOVERY - Host ms-be2030 is UP: PING OK - Packet loss = 0%, RTA = 38.86 ms [22:47:33] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) The server reset again. This time I have something in the log file. see below {F26118220} [22:48:45] (03PS2) 10Dzahn: releases: add warning motd on secondary server [puppet] - 10https://gerrit.wikimedia.org/r/461810 (https://phabricator.wikimedia.org/T205037) [22:56:23] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12528/releases1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/461810 (https://phabricator.wikimedia.org/T205037) (owner: 10Dzahn) [23:02:34] (03PS1) 10Dzahn: releases: fix template name for motd warning [puppet] - 10https://gerrit.wikimedia.org/r/461816 (https://phabricator.wikimedia.org/T205037) [23:10:51] (03CR) 10Dzahn: [C: 032] releases: fix template name for motd warning [puppet] - 10https://gerrit.wikimedia.org/r/461816 (https://phabricator.wikimedia.org/T205037) (owner: 10Dzahn) [23:31:04] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) Called HP at 17:48 CDT call ended at 18:25 CDT. went over again the new Log from today and didn't see any internal problem on the server. The only thing he asked me to do it to change the power settin... [23:34:37] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) p:05High>03Normal [23:37:39] 10Puppet, 10Cloud-VPS: cloudvps: puppet project trusty deprecation - https://phabricator.wikimedia.org/T204558 (10Dzahn) @aborrero The trusty instances above can be removed (if needed) @Krenair Is this resolved after shutdown or only after actual instance deletion? [23:39:55] 10Puppet, 10Cloud-VPS: cloudvps: puppet project trusty deprecation - https://phabricator.wikimedia.org/T204558 (10Krenair) Only after instance deletion. Shutting an instance down is reversible. [23:53:37] 10Operations, 10DBA, 10JADE, 10Patch-For-Review, and 2 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) Thanks for all the attention given to this, and apologies for thinking that the namespace condition would behave the same in...