[00:00:04] if mediawiki took too long theoretically varnish would complain [00:00:39] so yeah wonder what that was [00:00:50] kind of difficult to debug now it's come back on it's own :/ [00:04:09] on the wikitech-static topic: i found it, it's "wikitech-static OK - wikitech and wikitech-static in sync (64257 < 200000s)" [00:04:17] and yes to what you said about beta [00:05:06] !log reset email for User:Galahad [00:05:07] that check about wt-static is on 3 hosts, labweb1001,labweb1002 and labtestweb2001 [00:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:18] mutante, woah hang on [00:08:28] something's changed [00:08:35] $ host wikitech.wikimedia.org [00:08:35] wikitech.wikimedia.org has address 91.198.174.192 [00:08:35] wikitech.wikimedia.org has IPv6 address 2620:0:862:ed1a::1 [00:08:39] that's text-lb [00:08:43] ! [00:08:55] what happened to that being directly exposed? [00:08:59] with labweb1001 and 1002 as backends?:) [00:09:29] wikitech.wikimedia.org: [00:09:29] director: 'labweb' [00:09:41] :) eqiad: 'labweb.svc.eqiad.wmnet' [00:09:43] I wonder if this means it'll get the headers for mobile stuff now [00:10:08] good question / probably [00:11:15] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 48877 MB (10% inode=99%) [00:16:22] mutante, hmm I have a feeling it isn't [00:17:50] ^ 32G left and running and more free space coming back (CRIT is set to alert at 10%) [00:18:40] we'll see a recover soon [00:18:45] RECOVERY - Disk space on elastic1019 is OK: DISK OK [00:18:47] might want to chat to bblack about varnish and wikitech at some point [00:19:23] yes, that's good [00:21:28] indeed it's nice it's not directly exposed anymore,when you said "not in cluster" that's why i immediately said it makes sense.. i was still thinking how it used to be [00:24:55] (03PS14) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [00:25:58] (03PS15) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [00:27:02] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [00:32:52] (03PS1) 10Dzahn: tor::relay: add configurable thirdparty APT source [puppet] - 10https://gerrit.wikimedia.org/r/456056 (https://phabricator.wikimedia.org/T196701) [00:33:50] (03CR) 10jerkins-bot: [V: 04-1] tor::relay: add configurable thirdparty APT source [puppet] - 10https://gerrit.wikimedia.org/r/456056 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [00:35:30] (03CR) 10Reedy: "I don't think this needs it's own template.." [puppet] - 10https://gerrit.wikimedia.org/r/455369 (https://phabricator.wikimedia.org/T202819) (owner: 10Reedy) [00:41:27] (03CR) 10Bartosz Dziewoński: [C: 031] Set category collation to 'uca-et-u-kn' on Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455804 (https://phabricator.wikimedia.org/T202977) (owner: 10Gerrit Patch Uploader) [00:58:50] (03CR) 10Dzahn: "yea, $other_wikis seems like a good idea .. the way this is configured is in the middle of a restructuring" [puppet] - 10https://gerrit.wikimedia.org/r/455369 (https://phabricator.wikimedia.org/T202819) (owner: 10Reedy) [00:59:51] (03PS3) 10Reedy: Add fixcopyright.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/455369 (https://phabricator.wikimedia.org/T202819) [01:02:49] (03PS2) 10Dzahn: tor::relay: add configurable thirdparty APT source [puppet] - 10https://gerrit.wikimedia.org/r/456056 (https://phabricator.wikimedia.org/T196701) [01:03:18] (03PS2) 10Catrope: Revert "Create copyviobot group in beta labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455741 [01:03:24] (03CR) 10Catrope: [C: 032] Revert "Create copyviobot group in beta labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455741 (owner: 10Catrope) [01:03:31] (03CR) 10jerkins-bot: [V: 04-1] tor::relay: add configurable thirdparty APT source [puppet] - 10https://gerrit.wikimedia.org/r/456056 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [01:04:41] (03Merged) 10jenkins-bot: Revert "Create copyviobot group in beta labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455741 (owner: 10Catrope) [01:06:20] (03PS3) 10Dzahn: tor::relay: add configurable thirdparty APT source [puppet] - 10https://gerrit.wikimedia.org/r/456056 (https://phabricator.wikimedia.org/T196701) [01:07:19] (03CR) 10jerkins-bot: [V: 04-1] tor::relay: add configurable thirdparty APT source [puppet] - 10https://gerrit.wikimedia.org/r/456056 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [01:09:33] (03CR) 10jenkins-bot: Revert "Create copyviobot group in beta labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455741 (owner: 10Catrope) [01:11:57] (03PS4) 10Dzahn: tor::relay: add configurable thirdparty APT source [puppet] - 10https://gerrit.wikimedia.org/r/456056 (https://phabricator.wikimedia.org/T196701) [01:14:56] (03CR) 10Dzahn: [C: 031] "re-compiled what Luca compiled earlier now with PS2. looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/455369 (https://phabricator.wikimedia.org/T202819) (owner: 10Reedy) [01:21:26] Krenair: one more thing before going afk.. take a look at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/402758/ [01:21:32] i rebased that manually [01:21:47] note how it only changes 1 file now and not 2 though [01:22:05] mount_filesystem.pp is already different meanwhile [01:24:12] mutante, the modules/swift/manifests/init_device.pp change is looking bad on there now [01:24:19] why is \/dev\/ being removed? [01:25:12] Krenair: ugh.. did i mess up the rebase? it's possible [01:25:21] point was to get it done first before the former parent [01:25:42] yeah [01:25:49] but there was a lot of change in between [01:25:53] since those were created i guess [01:25:57] I think the /dev removal only makes sense with the parent [01:27:20] ok. well my intention was to get 1/2 merged rather than none and that had a positive response already [01:27:52] yes [01:28:02] good idea [01:28:22] it needed manual rebase one way or another i think [01:28:48] gotta go afk for now, we'll get back to it [01:29:02] cool, cya [01:29:52] (03CR) 10Alex Monk: [C: 04-1] "Just need to re-add the /dev stuff as it doesn't make sense without the parent" [puppet] - 10https://gerrit.wikimedia.org/r/402758 (https://phabricator.wikimedia.org/T184236) (owner: 10Alex Monk) [01:48:09] (03PS16) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [01:49:19] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [01:51:35] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 7.46 seconds [01:56:23] (03CR) 10Alex Monk: "need to figure out why the python3-acme pin was necessary, for some reason by default http://deb.debian.org/debian has 500 and http://mirr" [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [02:28:38] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.18) (duration: 08m 01s) [02:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:43] 10Operations, 10Core-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [03:02:59] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.19) (duration: 16m 23s) [03:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:13] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Aug 29 03:13:13 UTC 2018 (duration 10m 14s) [03:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:15] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 928.29 seconds [03:56:16] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 256.79 seconds [04:00:04] kart_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for ContentTranslation old draft purge script run (T201895) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180829T0400). [04:00:04] kart_: A patch you scheduled for ContentTranslation old draft purge script run (T201895) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [04:00:04] T201895: Third manual run of unpublished draft purge script - https://phabricator.wikimedia.org/T201895 [04:02:12] yeah [04:28:53] !log Finished old draft purge for ContentTranslation script run on mwmaint1001 (T201895) [04:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:59] T201895: Third manual run of unpublished draft purge script - https://phabricator.wikimedia.org/T201895 [04:38:48] (03PS1) 10Alex Monk: certcentral_api: basic functionality fixes and error log [software/certcentral] - 10https://gerrit.wikimedia.org/r/456067 [04:40:13] (03CR) 10Alex Monk: "the last two are probably unideal" [software/certcentral] - 10https://gerrit.wikimedia.org/r/456067 (owner: 10Alex Monk) [04:40:16] (03CR) 10jerkins-bot: [V: 04-1] certcentral_api: basic functionality fixes and error log [software/certcentral] - 10https://gerrit.wikimedia.org/r/456067 (owner: 10Alex Monk) [05:00:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:00:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:02:15] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [05:02:36] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:02:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:02:56] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:04:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:04:46] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:08:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456070 [05:10:16] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456070 (owner: 10Marostegui) [05:11:35] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456070 (owner: 10Marostegui) [05:14:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [05:14:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [05:15:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 (duration: 01m 07s) [05:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:06] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:21:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:21:26] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [05:23:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456071 [05:24:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456071 (owner: 10Marostegui) [05:24:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456070 (owner: 10Marostegui) [05:25:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456071 (owner: 10Marostegui) [05:27:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 (duration: 00m 54s) [05:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:36] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2088 rebooted itself and came back sick - https://phabricator.wikimedia.org/T202822 (10Marostegui) s1 finished checking - all good. Going to repool this host now. [05:28:45] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:29:58] (03PS1) 10Marostegui: db2088.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/456072 (https://phabricator.wikimedia.org/T202822) [05:33:33] (03CR) 10Marostegui: [C: 032] db2088.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/456072 (https://phabricator.wikimedia.org/T202822) (owner: 10Marostegui) [05:37:07] 08Warning Alert for device cr2-esams.wikimedia.org - Inbound interface errors [05:40:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456071 (owner: 10Marostegui) [05:42:18] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456074 [05:42:23] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456074 [05:44:10] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456074 (owner: 10Marostegui) [05:45:22] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456074 (owner: 10Marostegui) [05:46:36] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2088 - T202822 (duration: 00m 55s) [05:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:41] T202822: db2088 rebooted itself and came back sick - https://phabricator.wikimedia.org/T202822 [05:46:50] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2088 rebooted itself and came back sick - https://phabricator.wikimedia.org/T202822 (10Marostegui) 05Open>03Resolved Server repooled [05:49:55] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [05:50:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 [05:51:06] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) db2042 keeps recharging. Even if the BBU fails eventually, I will leave it as it is and not replace the BBU (we only have 1, which is the one from db2064). The reason I wouldn't use the BBU... [05:55:26] !log Deploy schema change on db1073 (labswiki) - T114117 T51191 T67448 [05:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:33] T114117: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 [05:55:34] T51191: Dropping rc_moved_to_title/rc_moved_to_ns on wmf databases - https://phabricator.wikimedia.org/T51191 [05:55:34] T67448: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448 [05:56:10] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456074 (owner: 10Marostegui) [05:58:12] !log Deploy schema change on db1073 (labswiki) - T187089 [05:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:19] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:02:06] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Inbound interface errors [06:02:23] !log Deploy schema change on db1083 (labswiki) - T89737 [06:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:29] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [06:03:13] !log migrate archiva.wikimedia.org to archiva1001 (upgrading archiva to its latest upstream version + Debian Stretch + Java 8) [06:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:26] this will likely cause a brief archiva downtime [06:03:31] let me know if it impacts you [06:04:04] (03CR) 10Elukey: [C: 032] Switch archiva.wikimedia.org to archiva1001 [dns] - 10https://gerrit.wikimedia.org/r/455760 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [06:04:07] (03PS4) 10Elukey: Switch archiva.wikimedia.org to archiva1001 [dns] - 10https://gerrit.wikimedia.org/r/455760 (https://phabricator.wikimedia.org/T192639) [06:04:35] (03PS3) 10Elukey: Move archiva.wikimedia.org from meitnerium to archiva1001 [puppet] - 10https://gerrit.wikimedia.org/r/455761 (https://phabricator.wikimedia.org/T192639) [06:08:39] (03CR) 10Elukey: [C: 032] Move archiva.wikimedia.org from meitnerium to archiva1001 [puppet] - 10https://gerrit.wikimedia.org/r/455761 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [06:12:59] done! [06:15:48] (03CR) 10Elukey: [C: 031] Remove now unused turnilo, superset, hue, yarn puppetization from thorium [puppet] - 10https://gerrit.wikimedia.org/r/455864 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [06:29:20] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:29:23] (03CR) 10Gehel: [C: 031] "No idea if it will have an impact on Cassandra, but "-XX:+UseNUMA" had quite an impact on elastic. Without it, we had most of the JVM memo" [puppet] - 10https://gerrit.wikimedia.org/r/426152 (https://phabricator.wikimedia.org/T192112) (owner: 10Eevans) [06:29:33] (03CR) 10Gehel: [C: 031] cassandra: restore (most) G1GC settings to defaults [puppet] - 10https://gerrit.wikimedia.org/r/426152 (https://phabricator.wikimedia.org/T192112) (owner: 10Eevans) [06:32:06] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssh/userkeys/root.d/labstore] [06:53:37] (03CR) 10Vgutierrez: [C: 04-1] certcentral_api: basic functionality fixes and error log (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456067 (owner: 10Alex Monk) [06:57:07] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:20] (03PS2) 10Gehel: Add esperanto and stop using gnupg.net as the key server [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/455854 (owner: 10DCausse) [06:58:02] (03CR) 10Gehel: Add esperanto and stop using gnupg.net as the key server (031 comment) [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/455854 (owner: 10DCausse) [06:59:27] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:27] 10Operations, 10SRE-Access-Requests: Please add everyone on the performance team to perf-roots - https://phabricator.wikimedia.org/T202648 (10MoritzMuehlenhoff) [06:59:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please add aaron to perf-team - https://phabricator.wikimedia.org/T202650 (10MoritzMuehlenhoff) 05Resolved>03Open I don't understand this task. Aaron already had global root already, why is that needed at all? [07:00:16] (03PS4) 10Gilles: Serve WebP variants for the hottest thumbnails [puppet] - 10https://gerrit.wikimedia.org/r/434055 (https://phabricator.wikimedia.org/T27611) [07:00:59] (03CR) 10Gilles: Serve WebP variants for the hottest thumbnails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434055 (https://phabricator.wikimedia.org/T27611) (owner: 10Gilles) [07:01:01] (03PS5) 10Gilles: Serve WebP variants for the hottest thumbnails [puppet] - 10https://gerrit.wikimedia.org/r/434055 (https://phabricator.wikimedia.org/T27611) [07:02:25] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs100[45] - https://phabricator.wikimedia.org/T202779 (10Gehel) @Cmjohnson wdqs1004 is back into rotation, ping me when you have time for the next one (we also have T202780) [07:05:16] (03CR) 10Gehel: [C: 032] Enable daily category diffs for internal [puppet] - 10https://gerrit.wikimedia.org/r/455766 (owner: 10Smalyshev) [07:06:05] (03PS2) 10Gehel: Enable daily category diffs for internal [puppet] - 10https://gerrit.wikimedia.org/r/455766 (owner: 10Smalyshev) [07:11:02] (03CR) 10Muehlenhoff: [C: 04-1] tor::relay: add configurable thirdparty APT source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456056 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [07:28:49] 10Operations, 10wikidiff2, 10Patch-For-Review: Create releasers-wikidiff2 group, split from releasers-mediawiki - https://phabricator.wikimedia.org/T202473 (10WMDE-Fisch) [07:28:54] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10Patch-For-Review, 10User-Addshore: Give WMDE-Fisch permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202475 (10WMDE-Fisch) 05Open>03Resolved a:03WMDE-Fisch >>! In T202475#4539129, @Legoktm wrote: >... [07:31:25] !log rebooting cp1008/pinkunicorn [07:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:04] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10ArielGlenn) I've talked about this a little with moritzm and we've decided to go back to the SRE meeting with it, since the solution prop... [07:36:53] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10jcrespo) 05Resolved>03Open We can reboot it again- it worked last time- at least as a short term measure. [07:37:33] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) It is still recharging, it has not failed yet [07:39:30] 10Operations, 10Dumps-Generation: Reboots of dumps/snapshot hosts for L1TF/microcode updates - https://phabricator.wikimedia.org/T202623 (10ArielGlenn) [07:39:33] (03CR) 10Volans: [C: 032] mediawiki: add siteinfo-related methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/455851 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:40:39] (03Merged) 10jenkins-bot: mediawiki: add siteinfo-related methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/455851 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:46:47] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) This is the HW log from the first time the battery failed (16th Aug) ``` description=POST Error: 1705-Slot X Drive Array - Please replace Cache Module Super-Cap. Caching will be enabled... [07:47:50] !log Reboot db2042 - T202051 [07:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:56] T202051: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 [07:57:00] (03CR) 10DCausse: [C: 031] Add esperanto and stop using gnupg.net as the key server [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/455854 (owner: 10DCausse) [07:58:08] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) After the reboot it has finally marked itself as failed: ``` date=08/29/2018 time=07:54 description=POST Error: 1705-Slot X Drive Array - Please replace Cache Module Super-Cap.... [08:04:48] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) [08:05:26] (03PS2) 10Addshore: InterwikiSortOrders.php doc / comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455765 (https://phabricator.wikimedia.org/T170745) [08:05:38] (03CR) 10Addshore: [C: 032] InterwikiSortOrders.php doc / comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455765 (https://phabricator.wikimedia.org/T170745) (owner: 10Addshore) [08:07:11] (03Merged) 10jenkins-bot: InterwikiSortOrders.php doc / comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455765 (https://phabricator.wikimedia.org/T170745) (owner: 10Addshore) [08:07:18] !log rolling reboot of wtp1* servers for kernel security update [08:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:28] (03PS1) 10Volans: Upstream release v0.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/456087 [08:10:20] !log addshore@deploy1001 Synchronized wmf-config/InterwikiSortOrders.php: DOCS ONLY: T170745 [[gerrit:455765|InterwikiSortOrders.php doc / comment]] (duration: 00m 57s) [08:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:25] T170745: Document InterwikiSortOrders.php - https://phabricator.wikimedia.org/T170745 [08:13:59] !log Force WriteBack policy on db2042 T202051 [08:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:04] T202051: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 [08:17:02] log Get policy back to WriteThrough on db2042 T202051 [08:17:34] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) I have forced the BBU to be WriteBack to let the server catch up: ``` root@db2042:~# hpssacli controller all show detail | grep "Drive Write Cache" Drive Write Cache: Disabled root@db... [08:17:48] (03CR) 10Volans: [C: 032] Upstream release v0.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/456087 (owner: 10Volans) [08:18:39] (03CR) 10Gehel: [C: 032] "checksums verified, all looks good" [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/455854 (owner: 10DCausse) [08:18:46] (03CR) 10Gehel: [V: 032 C: 032] Add esperanto and stop using gnupg.net as the key server [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/455854 (owner: 10DCausse) [08:18:52] (03Merged) 10jenkins-bot: Upstream release v0.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/456087 (owner: 10Volans) [08:21:02] (03CR) 10jenkins-bot: InterwikiSortOrders.php doc / comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455765 (https://phabricator.wikimedia.org/T170745) (owner: 10Addshore) [08:23:34] (03CR) 10Gehel: [C: 031] Upgrade to 6.3.1-alpha1 (without hebrew) (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/446869 (https://phabricator.wikimedia.org/T199791) (owner: 10DCausse) [08:23:44] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please add aaron to perf-team - https://phabricator.wikimedia.org/T202650 (10ArielGlenn) As Moritz points out, we have a cron job that explicitly checks for duplicate permissions (modules/openldap/files/cross-validate-accounts.py) and it flagged this.... [08:25:27] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on scb1002 is CRITICAL: 10 ge 4 Muehlenhoff Memory has been swapped, Icinga check will recover after four days. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops [08:31:07] !log repair sdd1 on ms-be2040 - T199198 [08:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:12] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [08:31:22] !log repair sdh1 on ms-be2043 - T199198 [08:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:15] (03PS1) 10Elukey: Assign role::spare::system to meitnerium [puppet] - 10https://gerrit.wikimedia.org/r/456090 (https://phabricator.wikimedia.org/T192639) [08:39:48] (03PS2) 10Elukey: Assign role::spare::system to meitnerium [puppet] - 10https://gerrit.wikimedia.org/r/456090 (https://phabricator.wikimedia.org/T192639) [08:40:51] (03CR) 10Filippo Giunchedi: [C: 031] [logstash] log all elastic queries [puppet] - 10https://gerrit.wikimedia.org/r/392603 (https://phabricator.wikimedia.org/T180051) (owner: 10DCausse) [08:41:37] PROBLEM - Ubuntu mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. [08:42:02] !log uploaded cumin_3.0.2-2+deb9u1 to apt.wikimedia.org stretch-wikimedia - T177385 [08:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:06] T177385: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385 [08:42:40] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10ArielGlenn) [08:49:33] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=wtp1043.eqiad.wmnet [08:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:26] !log joal@deploy1001 Started deploy [analytics/refinery@1c6423f]: Fix over yesterday weekly deploy of analytics Hadoop jobs [08:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:45] !log uploaded python{,3}-conftool_1.0.2-1{,+deb9u1} to apt.wikimedia.org {jessie,stretch}-wikimedia - T177385 [08:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:49] T177385: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385 [09:00:34] (03PS1) 10Elukey: memcached: add the possibility to configure -v* parameters [puppet] - 10https://gerrit.wikimedia.org/r/456096 [09:01:10] (03PS2) 10Filippo Giunchedi: Switch statsd/carbon to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/455808 (https://phabricator.wikimedia.org/T196484) [09:02:27] (03CR) 10Filippo Giunchedi: [C: 032] Switch statsd/carbon to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/455808 (https://phabricator.wikimedia.org/T196484) (owner: 10Filippo Giunchedi) [09:03:50] !log uploaded spicerack_0.0.2-1{,+deb9u1} to apt.wikimedia.org {jessie,stretch}-wikimedia - T177385 [09:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:55] T177385: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385 [09:04:43] !log switch statsd and carbon CNAMEs to graphite1004 - T196484 [09:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:48] T196484: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 [09:05:38] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10ArielGlenn) @gabriel-wmde Are you working with someone in analytics on this, could their manager sign off? [09:05:49] !log joal@deploy1001 Finished deploy [analytics/refinery@1c6423f]: Fix over yesterday weekly deploy of analytics Hadoop jobs (duration: 11m 23s) [09:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:10] (03PS2) 10Elukey: memcached: add the possibility to configure -v* parameters [puppet] - 10https://gerrit.wikimedia.org/r/456096 [09:06:55] (03PS2) 10Marostegui: filtered_tables: Remove unused columns [puppet] - 10https://gerrit.wikimedia.org/r/450934 (https://phabricator.wikimedia.org/T51191) [09:08:41] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456097 [09:09:09] !log upgraded python{,3}-conftool,spicerack on sarin [09:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:23] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [09:11:15] that's me ^ [09:12:04] (03PS7) 10Vgutierrez: [WIP] Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) [09:13:27] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [09:15:31] 10Operations, 10Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860 (10ArielGlenn) Ping: is the optoutresearch@ alias actually being used? Let's get a decision on this so we can move forward one way or the other. [09:19:01] 10Operations, 10Mail: update exim::listserve::private::mailing_lists value in puppet - https://phabricator.wikimedia.org/T82350 (10ArielGlenn) After (checks the calendar) 4 years, is this still a to-do or do we decline/resolve it? [09:19:39] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10ArielGlenn) [09:20:35] (03PS8) 10Vgutierrez: [WIP] Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) [09:22:33] (03PS1) 10Filippo Giunchedi: Revert "Switch statsd/carbon to graphite1004" [dns] - 10https://gerrit.wikimedia.org/r/456098 [09:23:11] (03CR) 10Filippo Giunchedi: [C: 032] Revert "Switch statsd/carbon to graphite1004" [dns] - 10https://gerrit.wikimedia.org/r/456098 (owner: 10Filippo Giunchedi) [09:24:03] RECOVERY - Filesystem available is greater than filesystem size on ms-be2043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [09:25:47] (03PS1) 10Banyek: admin: Create authorization check for https://tendril.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) [09:26:24] (03CR) 10jerkins-bot: [V: 04-1] admin: Create authorization check for https://tendril.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) (owner: 10Banyek) [09:28:08] (03PS2) 10Banyek: admin: Create authorization check for https://tendril.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) [09:30:13] 10Operations, 10Africa-Wikimedia-Developers, 10Wikimedia-Mailing-lists: Rename project mailing list for Africa Wikimedia Developers project - https://phabricator.wikimedia.org/T183832 (10ArielGlenn) Pinging @D3r1ck01 for an update on this. [09:31:42] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [09:32:08] (03CR) 10Arturo Borrero Gonzalez: [C: 031] Revert "dumps: give access to perf-team" [puppet] - 10https://gerrit.wikimedia.org/r/455902 (owner: 10Bstorm) [09:32:13] 10Operations, 10docker-pkg, 10Patch-For-Review: releng/mediawiki-phpcs-dryrun fails to upload to docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T200722 (10hashar) https://gerrit.wikimedia.org/r/#/c/integration/config/+/455835/ drops the `releng/mediawiki-phpcs-dryrun` container by refactor... [09:32:33] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [09:33:46] !log cr1/2-eqiad: update analytics-in4 filter with the new archiva host, add a new term 'archiva' to analytics-in6 filter - T198623 [09:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:52] T198623: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 [09:34:19] 10Operations, 10Wikimedia-Planet: en.planet hasn't updated since July 25 - https://phabricator.wikimedia.org/T203055 (10Peachey88) [09:34:29] 10Operations, 10Africa-Wikimedia-Developers, 10Wikimedia-Mailing-lists: Rename project mailing list for Africa Wikimedia Developers project - https://phabricator.wikimedia.org/T183832 (10D3r1ck01) Hey @ArielGlenn, we've had 1 meeting about this and ehhh the renaming is not much of a big deal. So for now, we... [09:39:46] 10Operations, 10Africa-Wikimedia-Developers, 10Wikimedia-Mailing-lists: Rename project mailing list for Africa Wikimedia Developers project - https://phabricator.wikimedia.org/T183832 (10ArielGlenn) 05stalled>03declined Declining as per @D3r1ck01 's comment above. [09:40:56] 10Operations, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Create mailing list for Bureaucrat of zh.wikipedia - https://phabricator.wikimedia.org/T202435 (10ArielGlenn) p:05Triage>03Normal [09:42:47] (03PS1) 10Aleksey Bekh-Ivanov (WMDE): Wikidata: Use new item ID formatter for Q1-Q10000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456103 (https://phabricator.wikimedia.org/T201834) [09:42:51] 10Operations, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Create mailing list for Bureaucrat of zh.wikipedia - https://phabricator.wikimedia.org/T202435 (10ArielGlenn) @herron Do the list requesters need to provide anything else, or what are the next steps for this? [09:43:04] (03CR) 10Aleksey Bekh-Ivanov (WMDE): "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456103 (https://phabricator.wikimedia.org/T201834) (owner: 10Aleksey Bekh-Ivanov (WMDE)) [09:44:28] (03CR) 10Marostegui: admin: Create authorization check for https://tendril.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) (owner: 10Banyek) [09:44:46] (03CR) 10Addshore: [C: 031] Wikidata: Use new item ID formatter for Q1-Q10000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456103 (https://phabricator.wikimedia.org/T201834) (owner: 10Aleksey Bekh-Ivanov (WMDE)) [09:44:57] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456097 (owner: 10Marostegui) [09:45:51] (03CR) 10Addshore: Wikidata: Use new item ID formatter for Q1-Q10000 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456103 (https://phabricator.wikimedia.org/T201834) (owner: 10Aleksey Bekh-Ivanov (WMDE)) [09:46:17] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456097 (owner: 10Marostegui) [09:46:32] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [09:46:32] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) [09:47:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 (duration: 00m 57s) [09:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456105 [09:48:14] !log Deploy schema change on db1102:s4 [09:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456105 (owner: 10Marostegui) [09:50:44] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456105 (owner: 10Marostegui) [09:51:47] (03PS9) 10Vgutierrez: Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) [09:51:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 (duration: 00m 58s) [09:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:20] !log Deploy schema change on db1091 [09:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:18] (03PS3) 10Faidon Liambotis: ssh-agent-proxy: support RSA SHA2 operations [puppet] - 10https://gerrit.wikimedia.org/r/455812 (https://phabricator.wikimedia.org/T202952) [09:54:20] (03PS4) 10Faidon Liambotis: ssh-agent-proxy: make key location configurable [puppet] - 10https://gerrit.wikimedia.org/r/455818 [09:54:22] (03PS4) 10Faidon Liambotis: ssh-agent-proxy: add default values to --help [puppet] - 10https://gerrit.wikimedia.org/r/455819 [09:54:24] (03PS2) 10Faidon Liambotis: ssh-agent-proxy: clear up client handling logic [puppet] - 10https://gerrit.wikimedia.org/r/455858 [09:54:26] (03PS4) 10Faidon Liambotis: ssh-agent-proxy: switch to logger and add --debug [puppet] - 10https://gerrit.wikimedia.org/r/455820 [09:54:28] (03PS1) 10Faidon Liambotis: ssh-agent-proxy: move the main code into functions [puppet] - 10https://gerrit.wikimedia.org/r/456106 [09:54:35] volans: ^ :) [09:54:41] paravoid: looking [09:54:46] the last one is new, the others are just updated [09:56:18] 10Operations, 10TCB-Team, 10WMDE-QWERTY-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-08-29: Release wikidiff2 v1.7.3 - https://phabricator.wikimedia.org/T202301 (10Lea_WMDE) [09:56:45] volans: how do you feel about "err = type('SshAgentProtocolError', (IOError,), {})" ? [09:57:10] I kinda hate it, this line feels very non-obvious and perl-y to me [09:57:29] 10Operations, 10Mail, 10Toolforge, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812 (10ArielGlenn) Whatever happened to this? Still alive as an issue? [09:57:38] can easily replace by "class SshAgentProtocolError(IOError): pass" and s/err/SshAgentProtocolError/ [09:58:43] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10fgiunchedi) >>! In T196484#4541406, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.org/sal/log/AWWE... [09:59:06] * volans still has to get to that one [10:00:38] that isn't part of my changes, it was just there [10:00:54] ah, checking [10:01:23] ewwww [10:01:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456097 (owner: 10Marostegui) [10:01:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456105 (owner: 10Marostegui) [10:01:51] hahaha [10:01:52] PROBLEM - HTTPS on archiva1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:02:59] my fault --^ [10:03:02] paravoid: yeah would be much better with: [10:03:03] class SshAgentProtocolError(OSError): [10:03:03] """Custom exception class.""" [10:03:25] note OSError instead IOError as in py3 the latter is deprecated and an alias to the former [10:03:39] (no need to add pass if there is a docstring ;) ) [10:03:51] ah, alright [10:04:02] RECOVERY - HTTPS on archiva1001 is OK: SSL OK - Certificate archiva.wikimedia.org valid until 2018-11-27 05:10:57 +0000 (expires in 89 days) [10:04:28] starting from 3.3 to be precise [10:06:09] 10Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674 (10Marostegui) Any objections to decline this ticket as per volans comment above? (T165674#4449561) [10:09:57] 10Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674 (10jcrespo) 05Open>03declined [10:12:15] 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289 (10ArielGlenn) Have you had a chance to check that it's working for you folks? [10:14:37] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/455812 (https://phabricator.wikimedia.org/T202952) (owner: 10Faidon Liambotis) [10:15:15] (03PS1) 10Elukey: archiva::proxy: add support for ipv6 to nginx listen directives [puppet] - 10https://gerrit.wikimedia.org/r/456108 (https://phabricator.wikimedia.org/T192639) [10:16:06] 10Operations: move human users out of UID range for system accounts - https://phabricator.wikimedia.org/T114446 (10ArielGlenn) Do we know how many current users this impacts in labs? If the number is small, new user + email to move their stuff may not be awful. [10:16:43] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/455819 (owner: 10Faidon Liambotis) [10:18:16] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/12271/" [puppet] - 10https://gerrit.wikimedia.org/r/456108 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [10:18:23] (03CR) 10Volans: [C: 031] "LGTM, optional nitpick inline (feel free to merge without review if changed)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/455820 (owner: 10Faidon Liambotis) [10:19:54] (03PS1) 10Elukey: archiva::proxy: missed a ';' in nginx's config [puppet] - 10https://gerrit.wikimedia.org/r/456109 (https://phabricator.wikimedia.org/T192639) [10:20:36] (03CR) 10Elukey: [C: 032] archiva::proxy: missed a ';' in nginx's config [puppet] - 10https://gerrit.wikimedia.org/r/456109 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [10:20:39] (03PS1) 10Vgutierrez: ACMERequests: Remove orders/challenges after a non-recoverable error [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) [10:20:45] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/456106 (owner: 10Faidon Liambotis) [10:21:09] 10Operations, 10Release-Engineering-Team: Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10ArielGlenn) p:05Triage>03Normal [10:21:27] 10Operations, 10Discovery-Search (Current work): Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10ArielGlenn) p:05Triage>03Normal [10:22:01] (03CR) 10jerkins-bot: [V: 04-1] ACMERequests: Remove orders/challenges after a non-recoverable error [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [10:22:18] !log joal@deploy1001 Started deploy [analytics/refinery@1c6423f]: Fix over yesterday weekly deploy of analytics Hadoop jobs - try 2 [10:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:31] 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289 (10Addshore) It looks like I still can't schedule downtime which is what I was expecting to happen... [10:23:22] (03PS2) 10Vgutierrez: ACMERequests: Remove orders/challenges after a non-recoverable error [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) [10:23:45] 10Operations, 10monitoring: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10ArielGlenn) p:05Triage>03Normal Is there someone who would want to oversee this getting done (not doing all the steps, just making sure the task moves along)? [10:23:54] (03PS3) 10Banyek: admin: Create authorization check for https://tendril.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) [10:24:06] (03PS2) 10Alex Monk: certcentral_api: basic functionality fixes and error log [software/certcentral] - 10https://gerrit.wikimedia.org/r/456067 [10:24:36] (03CR) 10jerkins-bot: [V: 04-1] admin: Create authorization check for https://tendril.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) (owner: 10Banyek) [10:25:39] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10ArielGlenn) p:05Triage>03High [10:26:04] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: Evaluate VMWare's Harbour as a docker registry - https://phabricator.wikimedia.org/T202504 (10ArielGlenn) p:05Triage>03Normal [10:27:01] 10Operations, 10ops-eqiad, 10DC-Ops: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T201957 (10ArielGlenn) p:05Triage>03Normal [10:27:18] (03PS1) 10Volans: sre.switchdc.mediawiki: add common parse_args [cookbooks] - 10https://gerrit.wikimedia.org/r/456111 (https://phabricator.wikimedia.org/T199079) [10:27:20] (03PS1) 10Volans: sre.switchdc.mediawiki: add Phase 0 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456112 (https://phabricator.wikimedia.org/T199079) [10:27:37] 10Operations, 10hardware-requests: Request for swift ms-be expansion - https://phabricator.wikimedia.org/T201937 (10ArielGlenn) p:05Triage>03High [10:27:51] 10Operations, 10hardware-requests: Request for swift ms-be refresh - https://phabricator.wikimedia.org/T201938 (10ArielGlenn) p:05Triage>03High [10:30:46] !log joal@deploy1001 Finished deploy [analytics/refinery@1c6423f]: Fix over yesterday weekly deploy of analytics Hadoop jobs - try 2 (duration: 08m 28s) [10:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:58] elukey: --^ [10:31:04] 10Operations, 10monitoring: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10ArielGlenn) p:05Triage>03Normal [10:31:16] gooood [10:34:08] (03PS17) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [10:35:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [10:38:16] !log elukey@deploy1001 Started deploy [analytics/refinery@1c6423f]: (no justification provided) [10:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:19] (03PS4) 10Banyek: admin: Create authorization check for https://tendril.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) [10:41:12] !log elukey@deploy1001 Finished deploy [analytics/refinery@1c6423f]: (no justification provided) (duration: 02m 56s) [10:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:39] (03CR) 10Alex Monk: "I would suggest a commit message beginning "tendril: Add monitoring for authorization check", this isn't in the admin module." [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) (owner: 10Banyek) [10:45:52] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [10:45:52] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) [10:46:40] (03PS5) 10Banyek: tendril: Add monitoring for authorization check [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) [10:47:23] PROBLEM - Check systemd state on proton1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:50:08] (03CR) 10Alex Monk: [C: 032] ACMERequests: Remove orders/challenges after a non-recoverable error [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [10:52:41] (03CR) 10Hashar: "Thank you :]" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398136 (owner: 10Hashar) [10:53:33] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [10:53:33] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) [10:54:02] RECOVERY - Check systemd state on proton1001 is OK: OK - running: The system is fully operational [10:58:15] (03PS3) 10Hashar: Generate documentation with Sphinx [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398462 [10:59:13] (03CR) 10Hashar: "Rebased, slightly tweaked a few things:" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398462 (owner: 10Hashar) [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180829T1100). [11:00:05] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:27] here [11:00:46] o/ [11:00:51] I can do the swat [11:01:11] setting $wgCategoryCollation , doesnt that need some changes on the database schema? [11:01:13] hashar: oh, go ahead then, I'll continue with T188742 :D [11:01:13] T188742: Run tests daily targeting beta cluster for all repositories with Selenium tests - https://phabricator.wikimedia.org/T188742 [11:01:24] it needs only updateCollation.php [11:01:30] (IIRC) [11:01:34] (03PS2) 10Hashar: Set $wgCategoryCollation = uca-az on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455887 (https://phabricator.wikimedia.org/T201770) (owner: 10Superyetkin) [11:01:40] (03PS1) 10Banyek: admin: add banyek to the dba contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/456114 (https://phabricator.wikimedia.org/T202521) [11:01:44] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455887 (https://phabricator.wikimedia.org/T201770) (owner: 10Superyetkin) [11:02:13] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [11:02:13] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) [11:02:33] PROBLEM - Check systemd state on proton1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:03:11] (03Merged) 10jenkins-bot: Set $wgCategoryCollation = uca-az on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455887 (https://phabricator.wikimedia.org/T201770) (owner: 10Superyetkin) [11:03:38] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) I've also documented the repair on https://wikitech.wikimedia.org/wiki/Graphite#Repair_xfs_misreporting_free_space as it will come up again for sure. [11:04:16] Ping me when needed, I'll continue with reading a thesis about Wikipedia in the meantime :D [11:04:43] (03PS2) 10Hashar: Allow sysops to remove flood flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455063 (https://phabricator.wikimedia.org/T202599) (owner: 10Urbanecm) [11:04:45] (03PS3) 10Hashar: Translation of scnwiktionary sitename was removed, add it back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455627 (https://phabricator.wikimedia.org/T202926) (owner: 10Urbanecm) [11:04:47] (03PS4) 10Hashar: Allow subpages in main namespace in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007) (owner: 10星耀晨曦) [11:04:49] (03PS2) 10Hashar: Upload new logos for advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455622 (https://phabricator.wikimedia.org/T202844) (owner: 10Urbanecm) [11:04:51] (03PS2) 10Hashar: Use new logos for advisorywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455623 (https://phabricator.wikimedia.org/T202844) (owner: 10Urbanecm) [11:05:34] (03CR) 10Faidon Liambotis: [C: 032] ssh-agent-proxy: support RSA SHA2 operations [puppet] - 10https://gerrit.wikimedia.org/r/455812 (https://phabricator.wikimedia.org/T202952) (owner: 10Faidon Liambotis) [11:05:42] (03CR) 10Marostegui: [C: 031] admin: add banyek to the dba contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/456114 (https://phabricator.wikimedia.org/T202521) (owner: 10Banyek) [11:05:54] (03CR) 10jerkins-bot: [V: 04-1] Use new logos for advisorywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455623 (https://phabricator.wikimedia.org/T202844) (owner: 10Urbanecm) [11:06:04] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set $wgCategoryCollation = uca-az on azwiki - T201770 (duration: 00m 56s) [11:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:09] T201770: Azerbaijani Wikipedia: Alphabetical order in the categories (collation) - https://phabricator.wikimedia.org/T201770 [11:06:25] (03PS4) 10Faidon Liambotis: keyholder: add a \n to a content => line [puppet] - 10https://gerrit.wikimedia.org/r/455821 [11:06:27] (03PS4) 10Faidon Liambotis: ssh-agent-proxy: support RSA SHA2 operations [puppet] - 10https://gerrit.wikimedia.org/r/455812 (https://phabricator.wikimedia.org/T202952) [11:06:27] !log mwscript updateCollation.php --wiki=azwiki --previous-collation=uppercase # T201770 [11:06:29] (03PS5) 10Faidon Liambotis: ssh-agent-proxy: make key location configurable [puppet] - 10https://gerrit.wikimedia.org/r/455818 [11:06:31] (03PS5) 10Faidon Liambotis: ssh-agent-proxy: add default values to --help [puppet] - 10https://gerrit.wikimedia.org/r/455819 [11:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:33] (03PS3) 10Faidon Liambotis: ssh-agent-proxy: clear up client handling logic [puppet] - 10https://gerrit.wikimedia.org/r/455858 [11:06:35] (03PS5) 10Faidon Liambotis: ssh-agent-proxy: switch to logger and add --debug [puppet] - 10https://gerrit.wikimedia.org/r/455820 [11:06:35] Urbanecm: collation script is running [11:06:37] (03PS2) 10Faidon Liambotis: ssh-agent-proxy: move the main code into functions [puppet] - 10https://gerrit.wikimedia.org/r/456106 [11:06:39] (03PS1) 10Faidon Liambotis: ssh-agent-proxy: use a custom exception class [puppet] - 10https://gerrit.wikimedia.org/r/456115 [11:06:39] ack [11:06:41] (03CR) 10Marostegui: [C: 031] "Commit message should use: contactgroups.cfg instead of admin I would say" [puppet] - 10https://gerrit.wikimedia.org/r/456114 (https://phabricator.wikimedia.org/T202521) (owner: 10Banyek) [11:07:17] (03CR) 10Faidon Liambotis: [C: 032] keyholder: add a \n to a content => line [puppet] - 10https://gerrit.wikimedia.org/r/455821 (owner: 10Faidon Liambotis) [11:07:32] (03CR) 10jenkins-bot: Set $wgCategoryCollation = uca-az on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455887 (https://phabricator.wikimedia.org/T201770) (owner: 10Superyetkin) [11:07:42] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [11:07:42] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) [11:07:59] (03CR) 10Faidon Liambotis: [C: 032] ssh-agent-proxy: make key location configurable [puppet] - 10https://gerrit.wikimedia.org/r/455818 (owner: 10Faidon Liambotis) [11:08:07] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455063 (https://phabricator.wikimedia.org/T202599) (owner: 10Urbanecm) [11:08:24] (03CR) 10Faidon Liambotis: [C: 032] ssh-agent-proxy: add default values to --help [puppet] - 10https://gerrit.wikimedia.org/r/455819 (owner: 10Faidon Liambotis) [11:08:29] (03PS2) 10Banyek: contactgroups.cfg: add banyek to the dba contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/456114 (https://phabricator.wikimedia.org/T202521) [11:08:58] paravoid: as those changes are only for the proxy, they should not need a re-arm of keyholder, but just keep an eye for the icinga check JIC ;) [11:10:19] (03CR) 10Faidon Liambotis: [C: 032] ssh-agent-proxy: clear up client handling logic [puppet] - 10https://gerrit.wikimedia.org/r/455858 (owner: 10Faidon Liambotis) [11:10:53] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [11:10:53] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) [11:11:14] volans: [11:11:15] - logger.debug('Unknown request code, refusing') [11:11:15] + logger.debug('Unknown request code %d, refusing', code) [11:11:18] that OK with you? [11:11:26] +! [11:11:27] +1 [11:11:40] (03PS6) 10Faidon Liambotis: ssh-agent-proxy: switch to logger and add --debug [puppet] - 10https://gerrit.wikimedia.org/r/455820 [11:11:42] (03PS3) 10Faidon Liambotis: ssh-agent-proxy: move the main code into functions [puppet] - 10https://gerrit.wikimedia.org/r/456106 [11:11:44] (03PS2) 10Faidon Liambotis: ssh-agent-proxy: use a custom exception class [puppet] - 10https://gerrit.wikimedia.org/r/456115 [11:11:44] thx [11:11:50] (03Merged) 10jenkins-bot: Allow sysops to remove flood flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455063 (https://phabricator.wikimedia.org/T202599) (owner: 10Urbanecm) [11:12:14] (03PS3) 10Banyek: contactgroups.cfg: add banyek to the dba contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/456114 (https://phabricator.wikimedia.org/T202521) [11:12:31] Urbanecm: sorry I got side tracked by something else. Proceeding with the rest of the patches [11:12:42] (03CR) 10Faidon Liambotis: [C: 032] ssh-agent-proxy: switch to logger and add --debug [puppet] - 10https://gerrit.wikimedia.org/r/455820 (owner: 10Faidon Liambotis) [11:12:56] (03CR) 10Hashar: [C: 032] Translation of scnwiktionary sitename was removed, add it back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455627 (https://phabricator.wikimedia.org/T202926) (owner: 10Urbanecm) [11:12:57] I don't know what "side tracked" means, but it's good SWAT's continuing :) [11:13:15] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Allow sysops to remove flood flag - T202599 (duration: 00m 56s) [11:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:21] T202599: New user groups in zhwikiversity - https://phabricator.wikimedia.org/T202599 [11:13:23] (03CR) 10Jcrespo: [C: 031] contactgroups.cfg: add banyek to the dba contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/456114 (https://phabricator.wikimedia.org/T202521) (owner: 10Banyek) [11:13:42] (03CR) 10Banyek: [C: 032] contactgroups.cfg: add banyek to the dba contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/456114 (https://phabricator.wikimedia.org/T202521) (owner: 10Banyek) [11:14:15] (03Merged) 10jenkins-bot: Translation of scnwiktionary sitename was removed, add it back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455627 (https://phabricator.wikimedia.org/T202926) (owner: 10Urbanecm) [11:15:22] (03PS3) 10Faidon Liambotis: ssh-agent-proxy: use a custom exception class [puppet] - 10https://gerrit.wikimedia.org/r/456115 [11:15:24] (03PS4) 10Faidon Liambotis: ssh-agent-proxy: move the main code into functions [puppet] - 10https://gerrit.wikimedia.org/r/456106 [11:15:55] volans: can you review those two? [11:16:11] sure [11:16:18] and feel free to +2 :) [11:16:48] (03CR) 10Volans: [C: 032] ssh-agent-proxy: use a custom exception class [puppet] - 10https://gerrit.wikimedia.org/r/456115 (owner: 10Faidon Liambotis) [11:17:14] (03PS4) 10Banyek: contactgroups.cfg: add banyek to the dba contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/456114 (https://phabricator.wikimedia.org/T202521) [11:17:16] (03CR) 10Volans: [C: 032] ssh-agent-proxy: move the main code into functions [puppet] - 10https://gerrit.wikimedia.org/r/456106 (owner: 10Faidon Liambotis) [11:17:27] paravoid: both +2 and submit [11:17:31] are you puppet-merging them? [11:17:46] yeah, just now [11:18:01] thank you so much for the reviews! [11:18:19] thanks for making it prettier :) [11:18:36] still needs some work I think [11:19:19] (03CR) 10Banyek: [V: 032 C: 032] contactgroups.cfg: add banyek to the dba contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/456114 (https://phabricator.wikimedia.org/T202521) (owner: 10Banyek) [11:19:24] quite some [11:19:42] (03PS5) 10Banyek: contactgroups.cfg: add banyek to the dba contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/456114 (https://phabricator.wikimedia.org/T202521) [11:20:02] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Translation of scnwiktionary sitename was removed, add it back - T202926 (duration: 00m 56s) [11:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:06] T202926: Translation of scnwiktionary sitename was removed - https://phabricator.wikimedia.org/T202926 [11:20:53] Urbanecm: looking at zhwikiversity namespace thing. I am gonna run namespaceDupes.php and fix the existing issues [11:20:59] (03CR) 10Banyek: [V: 032 C: 032] contactgroups.cfg: add banyek to the dba contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/456114 (https://phabricator.wikimedia.org/T202521) (owner: 10Banyek) [11:21:36] (03Abandoned) 10Gehel: Elasticsearch now uses the more generic nginx::simple_tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/304010 (owner: 10Gehel) [11:21:41] bah that does not fix anything ... [11:21:50] hashar, because zeljkof ran it yesterday [11:21:56] ok [11:22:08] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007) (owner: 10星耀晨曦) [11:22:26] So if you're seeing some issues, they're probably caused by something on wiki [11:22:49] ok ok :) [11:23:20] Double ok, nice :D [11:23:27] (03Merged) 10jenkins-bot: Allow subpages in main namespace in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007) (owner: 10星耀晨曦) [11:23:43] PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [11:24:01] ugh [11:24:28] armed [11:24:29] how can that be, are you working on it? [11:24:31] yes [11:24:33] I suppose so [11:24:34] ok [11:24:42] caused by a \n :) [11:24:47] eheheh [11:24:52] RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. [11:25:11] is there any sort of wmf mailing list to get emails about all phab UBN tickets? [11:25:15] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Allow subpages in main namespace in zhwikiversity - T202007 (duration: 00m 56s) [11:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:20] T202007: Allow subpages in main namespace in zhwikiversity - https://phabricator.wikimedia.org/T202007 [11:26:10] (03CR) 10jenkins-bot: Allow sysops to remove flood flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455063 (https://phabricator.wikimedia.org/T202599) (owner: 10Urbanecm) [11:26:12] (03CR) 10jenkins-bot: Translation of scnwiktionary sitename was removed, add it back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455627 (https://phabricator.wikimedia.org/T202926) (owner: 10Urbanecm) [11:26:14] (03CR) 10jenkins-bot: Allow subpages in main namespace in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007) (owner: 10星耀晨曦) [11:27:05] Urbanecm: and now doing the advsiorswiki logo changes [11:27:09] ack [11:27:19] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455622 (https://phabricator.wikimedia.org/T202844) (owner: 10Urbanecm) [11:27:59] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455623 (https://phabricator.wikimedia.org/T202844) (owner: 10Urbanecm) [11:28:13] PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [11:28:33] (03Merged) 10jenkins-bot: Upload new logos for advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455622 (https://phabricator.wikimedia.org/T202844) (owner: 10Urbanecm) [11:28:35] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455623 (https://phabricator.wikimedia.org/T202844) (owner: 10Urbanecm) [11:28:39] paravoid: I'm curious why they need re-arm ^^^ [11:28:42] PROBLEM - Keyholder SSH agent on sarin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [11:28:46] if you restart only the proxy it shouldn't [11:28:52] PROBLEM - Keyholder SSH agent on labpuppetmaster1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [11:29:01] volans: the \n change in the config... [11:29:22] ahhhhh sorry I though you restartd the one on netmon and pasted a \n in the password [11:29:35] paravoid: do you need a hand to re-arm them? [11:29:49] (03Merged) 10jenkins-bot: Use new logos for advisorywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455623 (https://phabricator.wikimedia.org/T202844) (owner: 10Urbanecm) [11:29:51] maybe? :) [11:30:00] !log rearmed keyholder on sarin [11:30:01] I don't even know the filename of the passwords :P [11:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:08] paravoid: https://wikitech.wikimedia.org/wiki/Keyholder [11:30:14] I'll do neodymium [11:30:18] !log hashar@deploy1001 Synchronized static/images/project-logos: Upload new logos for advisorswiki - T202844 (duration: 00m 37s) [11:30:22] may you rearm the keyholder on deploy1001 as well ? :D [11:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:28] T202844: Change logo for advisors.wikimedia.org - https://phabricator.wikimedia.org/T202844 [11:30:33] !log last sync did not synchronized due to ssh / keyholder isses [11:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:42] PROBLEM - Keyholder SSH agent on deploy2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [11:30:50] hashar: sure [11:30:52] ah there's a table, how nice [11:30:53] RECOVERY - Keyholder SSH agent on sarin is OK: OK: Keyholder is armed with all configured keys. [11:31:59] labpuppetmaster1002 done [11:32:00] sorry guys [11:32:00] hashar: done [11:32:03] RECOVERY - Keyholder SSH agent on labpuppetmaster1002 is OK: OK: Keyholder is armed with all configured keys. [11:32:08] deoloy[12]001 done [11:32:26] volans: thanks :) [11:32:42] RECOVERY - Keyholder SSH agent on deploy2001 is OK: OK: Keyholder is armed with all configured keys. [11:32:59] neodymium done [11:33:01] Urbanecm: syncing logo files again :] [11:33:04] ack [11:33:19] !log hashar@deploy1001 Synchronized static/images/project-logos: Upload new logos for advisorswiki - T202844 (duration: 00m 55s) [11:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:52] labpuppetmasters done [11:34:31] Urbanecm: and the collation update is still running [11:34:43] RECOVERY - Keyholder SSH agent on deploy1001 is OK: OK: Keyholder is armed with all configured keys. [11:34:50] I hope it won't run too long [11:35:06] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Use new logos for advisorywiki - T202844 (duration: 00m 57s) [11:35:08] 880k items so far [11:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:17] advisorywiki should have new logos now [11:36:02] That's nearly the end, azwiki has 924k rows [11:36:13] yes, it have, nice :) [11:36:24] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) @Papaul : Does this maybe need some additional changein the BIOS to make the server PXE-boot from the internal NIC? When I'm trying to install it, I still see that i... [11:36:45] netmon2001 done too, I think they're all [11:36:52] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [11:36:52] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) [11:37:07] accoriding to hosts matching R:keyholder::agent matches: [11:37:08] !log deploy1001: "mwscript updateCollation.php --wiki=azwiki --previous-collation=uppercase" completed | T201770 [11:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:13] T201770: Azerbaijani Wikipedia: Alphabetical order in the categories (collation) - https://phabricator.wikimedia.org/T201770 [11:37:25] so, 6 patches deployed, 30 minutes left. Do you have a few of minutes to deploy rest of my patches to prevent me from eating other swat :D? [11:37:41] !log European SWAT completed. [11:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:46] ahh [11:37:48] yeah sure [11:38:00] ok, I'll add them to the calendar [11:38:52] hashar, {{done}} [11:39:13] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455240 (https://phabricator.wikimedia.org/T177506) (owner: 10Urbanecm) [11:39:18] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455241 (https://phabricator.wikimedia.org/T177506) (owner: 10Urbanecm) [11:39:23] PROBLEM - Keyholder SSH agent on labpuppetmaster1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [11:39:33] ah wgCopyUploadsDomains ... [11:39:38] you hate that variable? [11:39:46] yeah [11:39:53] why, if I may ask? :D [11:39:59] the GWToolset was really just for a very specific use case [11:40:23] namely mass uploading massive collections of pictures from libraries / museum [11:40:44] Tempora mutantur, nos et mutamur in illis [11:40:50] (03Merged) 10jenkins-bot: Upload HD logos for various wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455240 (https://phabricator.wikimedia.org/T177506) (owner: 10Urbanecm) [11:40:52] and it seems to now be used to upload random collections of files accross the internet that most probably would not be used ever :] [11:40:56] but yeah I am ranting really ! [11:41:23] (03CR) 10jenkins-bot: Upload new logos for advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455622 (https://phabricator.wikimedia.org/T202844) (owner: 10Urbanecm) [11:41:25] (03CR) 10jenkins-bot: Use new logos for advisorywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455623 (https://phabricator.wikimedia.org/T202844) (owner: 10Urbanecm) [11:41:27] (03CR) 10jenkins-bot: Upload HD logos for various wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455240 (https://phabricator.wikimedia.org/T177506) (owner: 10Urbanecm) [11:41:53] Urbanecm: I will do it of course :] [11:42:03] That's up to you ofc [11:42:42] !log hashar@deploy1001 Synchronized static/images/project-logos: Upload HD logos for various wikibooks - T177506 (duration: 00m 55s) [11:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:47] T177506: Provide HD logos for all Wikibooks - https://phabricator.wikimedia.org/T177506 [11:43:08] (03PS2) 10Hashar: Use HD logos for various Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455241 (https://phabricator.wikimedia.org/T177506) (owner: 10Urbanecm) [11:43:17] (03CR) 10Hashar: [C: 032] Use HD logos for various Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455241 (https://phabricator.wikimedia.org/T177506) (owner: 10Urbanecm) [11:44:44] (03Merged) 10jenkins-bot: Use HD logos for various Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455241 (https://phabricator.wikimedia.org/T177506) (owner: 10Urbanecm) [11:44:46] Urbanecm: I think https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/455234/ needs a rebase :] [11:45:10] will do it manually, wait a minute [11:45:40] apergos: i just had a crazy idea, we are discussing getting notifications about UBN tickets tagged as wikidata to us (the wikidata team), would having icinga check the number of UBN wikidata tickets there are and alarm if greater than 0 be a bit of an abuse of icinga or perhaps okay? [11:46:30] my knee jerk reaction is that I don't love it [11:46:40] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Use HD logos for various Wikibooks - T177506 (duration: 00m 56s) [11:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:46] * addshore goes to look at grafana notifications and alerts [11:47:25] ACKNOWLEDGEMENT - Keyholder SSH agent on labpuppetmaster1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. Arturo Borrero Gonzalez ACK [11:47:31] (03PS2) 10Urbanecm: yphc.ir to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455234 (https://phabricator.wikimedia.org/T201237) [11:47:40] addshore: I strongly discourage it, as anyone can change the priority to UBN [11:47:58] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455234 (https://phabricator.wikimedia.org/T201237) (owner: 10Urbanecm) [11:47:59] is it the sms part you are really wanting? because sending an email to the list you are all on, that's easy enough [11:48:02] arturo: mmmh strange, I did re-arm it before [11:48:09] volans: :-S [11:48:15] apergos: just email really [11:48:25] maybe pupept didn't run yet [11:48:28] sorry, my bad [11:48:42] addshore, that's something Herald can do, email you regardless on your notification privileges [11:48:58] I would say that once you have a script that counts ubns (which you would need for the check anyways), having it send mail to your list, and running it out of cron every so often ought to be fine [11:49:05] are there any example rules that already do that? [11:49:16] addshore, wait a moment [11:49:22] (03Merged) 10jenkins-bot: yphc.ir to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455234 (https://phabricator.wikimedia.org/T201237) (owner: 10Urbanecm) [11:49:26] oh herald [11:49:28] right [11:49:49] volans: [11:49:51] https://www.irccloud.com/pastebin/v1ug8qDA/ [11:50:18] addshore, https://phabricator.wikimedia.org/H59 has an email action [11:50:29] yeah probably I run arm before puppet had run there, so it restarted it after running [11:50:32] sorry about that [11:50:35] IIRC it ignores mail preferences, but I did not try. [11:50:53] RECOVERY - Keyholder SSH agent on labpuppetmaster1001 is OK: OK: Keyholder is armed with all configured keys. [11:51:01] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: yphc.ir to the wgCopyUploadsDomains whitelist - T201237 (duration: 00m 56s) [11:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:06] T201237: Please add yphc.ir to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T201237 [11:51:08] volans: np, I don't even know what is this keyholder all about :-P [11:51:37] ssh agent+proxy to allow users to use an ssh key without having access to the private key or its passphrase [11:51:49] arturo: https://wikitech.wikimedia.org/wiki/Keyholder [11:53:00] !log European SWAT completed (2) [11:53:01] ok thanks [11:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:03] Urbanecm: done done done :] [11:53:09] thanks [11:56:42] (03CR) 10jenkins-bot: Use HD logos for various Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455241 (https://phabricator.wikimedia.org/T177506) (owner: 10Urbanecm) [11:56:44] (03CR) 10jenkins-bot: yphc.ir to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455234 (https://phabricator.wikimedia.org/T201237) (owner: 10Urbanecm) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180829T1200) [12:10:05] 10Operations, 10monitoring: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Volans) @ArielGlenn yes, if I'm not mistaken @Dzahn has volunteered to do it. [12:15:22] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [12:15:23] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) [12:18:22] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 48.15, 29.69, 19.25 [12:23:12] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [12:23:12] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) [12:25:45] (03PS1) 10Jonas Kress (WMDE): Enabel WBQualityConstraintsSuggestionsBetaFeature on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456124 (https://phabricator.wikimedia.org/T202712) [12:26:29] (03PS2) 10Jonas Kress (WMDE): Enable WBQualityConstraintsSuggestionsBetaFeature on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456124 (https://phabricator.wikimedia.org/T202712) [12:29:42] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [12:29:42] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) [12:32:46] (03PS1) 10Ladsgroup: ores in labs: issue 403 for two user agents [puppet] - 10https://gerrit.wikimedia.org/r/456126 (https://phabricator.wikimedia.org/T202655) [12:39:32] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: nova-network: allow extra dnsmasq/dhcp option [puppet] - 10https://gerrit.wikimedia.org/r/456127 (https://phabricator.wikimedia.org/T202636) [12:40:12] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: nova-network: allow extra dnsmasq/dhcp option [puppet] - 10https://gerrit.wikimedia.org/r/456127 (https://phabricator.wikimedia.org/T202636) (owner: 10Arturo Borrero Gonzalez) [12:40:12] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 9.83, 15.55, 23.28 [12:40:23] RECOVERY - Check systemd state on proton1001 is OK: OK - running: The system is fully operational [12:40:42] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [12:40:42] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) [12:42:58] (03CR) 10Elukey: [C: 032] Assign role::spare::system to meitnerium [puppet] - 10https://gerrit.wikimedia.org/r/456090 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [12:43:05] (03PS3) 10Elukey: Assign role::spare::system to meitnerium [puppet] - 10https://gerrit.wikimedia.org/r/456090 (https://phabricator.wikimedia.org/T192639) [12:44:57] 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289 (10ArielGlenn) 05Open>03Resolved I have it from a very good source (;-) that the problem turne... [12:47:02] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is happy:" [puppet] - 10https://gerrit.wikimedia.org/r/456127 (https://phabricator.wikimedia.org/T202636) (owner: 10Arturo Borrero Gonzalez) [12:48:03] (03CR) 10Elukey: "looks good, added a little comment!" (031 comment) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/456019 (https://phabricator.wikimedia.org/T202812) (owner: 10Ottomata) [12:48:11] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: nova-network: allow extra dnsmasq/dhcp option [puppet] - 10https://gerrit.wikimedia.org/r/456127 (https://phabricator.wikimedia.org/T202636) [12:48:14] (03PS6) 10Banyek: tendril: Add monitoring for authorization check [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) [12:52:33] (03CR) 10Elukey: "Just to understand: IIRC spark-env.sh is deployed via the spark2 deb, so this change will make sure that puppet forces a version with pyth" [puppet] - 10https://gerrit.wikimedia.org/r/456020 (owner: 10Ottomata) [12:54:45] (03PS7) 10Banyek: tendril: Add monitoring for authorization check [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) [12:55:53] (03CR) 10Jcrespo: [C: 031] tendril: Add monitoring for authorization check [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) (owner: 10Banyek) [12:56:12] PROBLEM - statsd UDP receive errors are elevated on graphite1004 is CRITICAL: 5.145 ge 2 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [12:58:22] RECOVERY - statsd UDP receive errors are elevated on graphite1004 is OK: (C)2 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [12:58:30] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi) [12:58:36] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review, and 2 others: Shorten logstash retention temporarily - https://phabricator.wikimedia.org/T201971 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving this since we're using less replicas for older indices now and no longer have file... [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180829T1300) [13:01:59] zeljkof: train is not running 'right now' right? Deployment window is just placeholder here? [13:02:45] kart_: yes, the wiki page only has a placeholder for EU train, but this week it's US train https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180829T1300 [13:02:55] tldr: no train right now :D [13:03:00] I'll go ahead for small cxserver update. Thanks zeljkof! [13:07:34] !log kartik@deploy1001 Started deploy [cxserver/deploy@c3385eb]: Update cxserver to afe0d1f (T202970) [13:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:40] T202970: Adapted templates pass too much data to CX frontend - https://phabricator.wikimedia.org/T202970 [13:08:38] (03PS1) 10Gehel: logstash: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456133 (https://phabricator.wikimedia.org/T198351) [13:10:06] (03PS1) 10Gehel: relforge: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456135 (https://phabricator.wikimedia.org/T198351) [13:11:21] !log kartik@deploy1001 Finished deploy [cxserver/deploy@c3385eb]: Update cxserver to afe0d1f (T202970) (duration: 03m 46s) [13:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:08] (03CR) 10Filippo Giunchedi: [C: 031] logstash: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456133 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel) [13:13:50] (03PS1) 10Gehel: elasticsearch: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456137 (https://phabricator.wikimedia.org/T198351) [13:13:52] (03PS1) 10Gehel: elasticsearch: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456138 (https://phabricator.wikimedia.org/T198351) [13:19:35] (03PS2) 10Ottomata: Install binary pyarrow package to /usr/lib/spark2/python on install [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/456019 (https://phabricator.wikimedia.org/T202812) [13:19:48] (03CR) 10Ottomata: [V: 032 C: 032] Install binary pyarrow package to /usr/lib/spark2/python on install [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/456019 (https://phabricator.wikimedia.org/T202812) (owner: 10Ottomata) [13:24:47] 10Operations, 10monitoring: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) a:03Dzahn [13:25:21] (03CR) 10Ottomata: "Hmm correct. I was about to comment that I'd prefer to avoid making modifications to the git source files in the .deb, so I'm doing it in" [puppet] - 10https://gerrit.wikimedia.org/r/456020 (owner: 10Ottomata) [13:26:52] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/12276" [puppet] - 10https://gerrit.wikimedia.org/r/456096 (owner: 10Elukey) [13:27:23] (03PS1) 10Ottomata: Default python3 and ipython3 in spark-env.sh [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/456142 [13:27:25] (03Abandoned) 10Ottomata: spark2 - custom spark-env.sh that defaults to using python3 (and ipython3) [puppet] - 10https://gerrit.wikimedia.org/r/456020 (owner: 10Ottomata) [13:27:55] (03CR) 10Ottomata: [V: 032 C: 032] Default python3 and ipython3 in spark-env.sh [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/456142 (owner: 10Ottomata) [13:29:02] 10Operations, 10Wikimedia-Planet: en.planet hasn't updated since July 25 - https://phabricator.wikimedia.org/T203055 (10Dzahn) a:03Dzahn [13:35:46] (03CR) 10Gehel: [C: 032] relforge: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456135 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel) [13:36:55] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10chasemp) >>! In T202708#4535151, @Gehel wrote: > It is not entirely clear what access we want to give @Mathew.onipe at this point. > > Constraints: > * Matt is a con... [13:37:51] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10Patch-For-Review, 10User-Addshore: Give WMDE-Fisch permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202475 (10Dzahn) releases.wikimedia.org has 2 backends, releases1001 and releases2001, so one in eqiad... [13:39:55] !log shutting down wdqs2001 for new SSD and reimaging - T202777 [13:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:01] T202777: add SSDs to wdqs200[12] - https://phabricator.wikimedia.org/T202777 [13:41:40] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=wdqs,name=wdqs2001.codfw.wmnet [13:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:42] !log restart relforge for plugin upgrade, data dir migration and kernel upgrade [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:57] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Gehel) Summarizing a few back channel conversations here: * the current thinking is to start by giving @Mathew.onipe a few already existing roles (elasticsearch-root... [13:53:57] 10Operations, 10Analytics: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) p:05Triage>03Normal [14:01:06] (03PS3) 10Bstorm: Revert "dumps: give access to perf-team" [puppet] - 10https://gerrit.wikimedia.org/r/455902 [14:03:20] (03CR) 10Bstorm: [C: 032] Revert "dumps: give access to perf-team" [puppet] - 10https://gerrit.wikimedia.org/r/455902 (owner: 10Bstorm) [14:03:29] (03PS1) 10Gilles: Upgrade to 2.01 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/456145 (https://phabricator.wikimedia.org/T198370) [14:04:17] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:04:46] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:07:12] (03PS2) 10Ottomata: Remove now unused turnilo, superset, hue, yarn puppetization from thorium [puppet] - 10https://gerrit.wikimedia.org/r/455864 (https://phabricator.wikimedia.org/T202011) [14:07:34] (03CR) 10Ottomata: [V: 032 C: 032] Remove now unused turnilo, superset, hue, yarn puppetization from thorium [puppet] - 10https://gerrit.wikimedia.org/r/455864 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [14:08:46] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:09:07] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:12:07] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:12:39] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) >>! In T202708#4542296, @Gehel wrote: > (reimaging servers, access to remote management consoles, ...), so at some point we'll need to provide larger accesses... [14:13:25] (03PS1) 10Volans: Change package name for PyPI [software/spicerack] - 10https://gerrit.wikimedia.org/r/456147 (https://phabricator.wikimedia.org/T199079) [14:14:17] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:14:49] (03CR) 10jerkins-bot: [V: 04-1] Change package name for PyPI [software/spicerack] - 10https://gerrit.wikimedia.org/r/456147 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:15:47] PROBLEM - superset on thorium is CRITICAL: connect to address 10.64.53.26 and port 9080: Connection refused [14:16:04] ^ icinga puppet running [14:16:19] hashar: No space left on device on n integration-slave-docker-1021 [14:17:26] PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [14:17:53] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) >>! In T202708#4529385, @Gehel wrote: > Some of the checklist items above would make more sense with an @wikimedia.org email (like exim email aliases), so thos... [14:18:36] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456149 [14:21:15] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456147 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:23:28] (03PS1) 10Gilles: Send Thumbor-Request-Id in haproxy response [puppet] - 10https://gerrit.wikimedia.org/r/456151 (https://phabricator.wikimedia.org/T187765) [14:29:42] (03PS1) 10Ayounsi: Revert "Rancid, comment out cr2-eqdfw until pubkey auth issue is solved" [puppet] - 10https://gerrit.wikimedia.org/r/456154 [14:30:29] (03CR) 10Ayounsi: [C: 032] Revert "Rancid, comment out cr2-eqdfw until pubkey auth issue is solved" [puppet] - 10https://gerrit.wikimedia.org/r/456154 (owner: 10Ayounsi) [14:30:42] (03PS1) 10Banyek: mariadb: db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456155 [14:30:47] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456149 (owner: 10Marostegui) [14:30:54] (03PS2) 10Ayounsi: Revert "Rancid, comment out cr2-eqdfw until pubkey auth issue is solved" [puppet] - 10https://gerrit.wikimedia.org/r/456154 [14:32:02] (03CR) 10jerkins-bot: [V: 04-1] mariadb: db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456155 (owner: 10Banyek) [14:32:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456149 (owner: 10Marostegui) [14:33:49] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 (duration: 00m 58s) [14:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:15] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456149 (owner: 10Marostegui) [14:36:11] (03PS1) 10Elukey: profile::archiva: allow rsync to bind to IPv6 interfaces [puppet] - 10https://gerrit.wikimedia.org/r/456156 (https://phabricator.wikimedia.org/T192639) [14:36:33] (03PS2) 10Banyek: mariadb: db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456155 [14:37:41] (03CR) 10jerkins-bot: [V: 04-1] mariadb: db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456155 (owner: 10Banyek) [14:39:11] 10Operations, 10ops-codfw, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs200[12] - https://phabricator.wikimedia.org/T202777 (10Papaul) a:05Papaul>03None @Gehel Disks added to wdqs2001 [14:39:38] so we onboarded everyone with their username as their gerrit login I guess? :) [14:39:48] gerrit/wikitech etc. [14:39:53] (03PS3) 10Banyek: mariadb: db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456155 [14:40:17] (03PS2) 10Elukey: profile::archiva: allow rsync to bind to IPv6 interfaces [puppet] - 10https://gerrit.wikimedia.org/r/456156 (https://phabricator.wikimedia.org/T192639) [14:40:18] paravoid: do you prefer that or not like it? [14:41:10] I prefer full names, but I guess that ship has sailed [14:41:20] e.g. all the commits above :) [14:41:43] we can document the preference for the future [14:41:47] the admin module usernames are expected to match the LDAP usernames IIRC [14:42:11] ? [14:42:11] yes, but probably most names on ldap were new [14:42:23] cn != uid [14:42:28] also that [14:43:23] It's possible to modify the cn after the fact and correct? [14:43:36] it is but very complicated IIRC [14:43:57] https://wikitech.wikimedia.org/wiki/LDAP/Renaming_users I think? [14:44:05] yeah [14:44:13] I don't think those docs are complete [14:44:15] it has been done once.. and that was before Phabricator existed [14:44:39] plus all kinds of other stuff like phabricator and logins/stored preferences in various tools [14:44:44] yes [14:44:54] perhaps a task for spicerack? ;) [14:45:01] +1 [14:47:01] Effie was also interested in that and upated the onboarding docs to point out how it's hard to change later [14:47:09] (03PS1) 10Joal: Update druid datasource in AQS config [puppet] - 10https://gerrit.wikimedia.org/r/456158 [14:47:13] elukey: --^ [14:48:50] (03PS3) 10Elukey: profile::archiva: allow rsync to bind to IPv6 interfaces [puppet] - 10https://gerrit.wikimedia.org/r/456156 (https://phabricator.wikimedia.org/T192639) [14:49:17] (03PS18) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:50:24] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [14:52:36] (03PS4) 10Elukey: profile::archiva: allow rsync to bind to IPv6 interfaces [puppet] - 10https://gerrit.wikimedia.org/r/456156 (https://phabricator.wikimedia.org/T192639) [14:58:16] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10Nuria) Approving access to EL data store [14:58:49] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/12280/archiva1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/456156 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [14:58:57] those docs about renaming also talk about SQL and gerrit.. that is outdated. Gerrit is using notesdb to store users now and there is an open unrelated ticket about issues with duplicate UIDs in Gerrit.. so messing with that requires a lot of caution [14:59:24] Yep a lot of caution but I think releng found a way to do it [14:59:30] But requires restarting gerrit [15:00:24] !log T202323 starting operations [15:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:06] paravoid: that renaming page is pre-phabricator IIRC [15:07:32] IMHO it would be ideal to retraoctively update ldap contents after ensuring the proper info will be captured into the appropriate attributes at account creation time. many users have for example the same word with mixed case for uid, cn and sn. on various occasions (and in the future again) it would be useful to for example to look up a user by last name using sn attribute. and adding/populating a gn attribute [15:07:32] would be helpful as well [15:10:24] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10ArielGlenn) [15:11:39] 10Operations, 10ops-codfw, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs200[12] - https://phabricator.wikimedia.org/T202777 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs2001.codfw.wmnet'] ``` The log can be found in `/var/log/w... [15:15:00] (03CR) 10Alexandros Kosiaris: [C: 031] Change package name for PyPI [software/spicerack] - 10https://gerrit.wikimedia.org/r/456147 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:15:14] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) @MoritzMuehlenhoff I changed the switch to use the ge-2/0/12 instance of xe-2/0/12 since we are using a 1GB transceiver. the installation is in progress i will let you know whe... [15:17:25] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) @MoritzMuehlenhoff the installation is complete it is all yours [15:17:26] (03PS1) 10ArielGlenn: add Gabriel Birke to shell users [puppet] - 10https://gerrit.wikimedia.org/r/456160 (https://phabricator.wikimedia.org/T202072) [15:24:43] (03CR) 10Dzahn: [C: 031] "key is matching ticket. UID and email are matching LDAP. lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/456160 (https://phabricator.wikimedia.org/T202072) (owner: 10ArielGlenn) [15:26:13] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@c33f6e5]: Update wikistats2 endpoints to use user_text and page_title [15:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:22] (03CR) 10Gehel: [C: 031] "yep, the new name makes sense!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456147 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:28:01] (03CR) 10ArielGlenn: [C: 032] add Gabriel Birke to shell users [puppet] - 10https://gerrit.wikimedia.org/r/456160 (https://phabricator.wikimedia.org/T202072) (owner: 10ArielGlenn) [15:28:55] (03PS2) 10Elukey: Update druid datasource in AQS config [puppet] - 10https://gerrit.wikimedia.org/r/456158 (owner: 10Joal) [15:32:32] (03PS1) 10ArielGlenn: add Gabriel Birke to analytics-users and researchers groups [puppet] - 10https://gerrit.wikimedia.org/r/456161 (https://phabricator.wikimedia.org/T202072) [15:35:14] (03CR) 10Thcipriani: [C: 031] "thanks for the change!" [puppet] - 10https://gerrit.wikimedia.org/r/434427 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [15:36:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10ArielGlenn) Before i actually add you to both groups, @gabriel-wmde , do you want to choose (sql or hive) or do you... [15:37:51] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@c33f6e5]: Update wikistats2 endpoints to use user_text and page_title (duration: 11m 38s) [15:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:22] (03CR) 10Elukey: [C: 032] Update druid datasource in AQS config [puppet] - 10https://gerrit.wikimedia.org/r/456158 (owner: 10Joal) [15:40:08] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2 - Objective 2: Set up a continuous integration and deployment pipeline - https://phabricator.wikimedia.org/T170481 (10thcipriani) [15:40:13] 10Operations, 10netops, 10Patch-For-Review: rancid pubkey auth to Junos 17.4 failure - https://phabricator.wikimedia.org/T202952 (10ayounsi) 05Open>03Resolved a:03ayounsi cr2-eqdfw is now being pulled properly by Rancid. Thanks! [15:42:04] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban), 10Services (watching): Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 (10thcipriani) p:05Triage>03Normal [15:42:08] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) First puppet run complete [15:42:24] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) [15:44:56] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban), 10Services (watching): Create Graphoid .pipeline files - https://phabricator.wikimedia.org/T203092 (10thcipriani) p:05Triage>03Normal [15:45:03] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Paladox) @thcipriani or @mmodell wondering if you be able to comment here that releng supports this avatar change and maintaining it please. (Ops need releng to comment) [15:47:48] !log T202323 labstore1004 reboot [15:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:00] !log roll restart aqs on aqs100[4-9] to pick up the new druid config settings [15:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:06] (03PS1) 10Gilles: Send blind thumbnail requests to inactive DC [puppet] - 10https://gerrit.wikimedia.org/r/456167 (https://phabricator.wikimedia.org/T201858) [15:51:10] !log imarlier@deploy1001 Started deploy [performance/coal@5d995a3]: (no justification provided) [15:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:16] !log imarlier@deploy1001 Finished deploy [performance/coal@5d995a3]: (no justification provided) (duration: 00m 06s) [15:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:24] !log deploy coal [15:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:46] (03CR) 10jerkins-bot: [V: 04-1] Send blind thumbnail requests to inactive DC [puppet] - 10https://gerrit.wikimedia.org/r/456167 (https://phabricator.wikimedia.org/T201858) (owner: 10Gilles) [15:53:22] !log T202323 labstore1004 now running latest kernel [15:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:52] (03PS1) 10Imarlier: sitemaps: Generalize varnish rule for sitemaps, to apply to all domains [puppet] - 10https://gerrit.wikimedia.org/r/456169 (https://phabricator.wikimedia.org/T198965) [15:58:41] 10Operations, 10ops-codfw, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs200[12] - https://phabricator.wikimedia.org/T202777 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs2001.codfw.wmnet'] ``` The log can be found in `/var/log/w... [15:59:32] (03PS2) 10Gilles: Send blind thumbnail requests to inactive DC [puppet] - 10https://gerrit.wikimedia.org/r/456167 (https://phabricator.wikimedia.org/T201858) [15:59:45] (03CR) 10Ottomata: [C: 032] Deploy wikistats from master branch [puppet] - 10https://gerrit.wikimedia.org/r/455892 (https://phabricator.wikimedia.org/T203017) (owner: 10Nuria) [15:59:51] (03PS4) 10Ottomata: Deploy wikistats from master branch [puppet] - 10https://gerrit.wikimedia.org/r/455892 (https://phabricator.wikimedia.org/T203017) (owner: 10Nuria) [15:59:57] (03CR) 10Ottomata: [V: 032 C: 032] Deploy wikistats from master branch [puppet] - 10https://gerrit.wikimedia.org/r/455892 (https://phabricator.wikimedia.org/T203017) (owner: 10Nuria) [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180829T1600). [16:00:04] Aleksey_WMDE: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:10] Here! [16:00:25] Who is SWATting? [16:00:37] (03PS1) 10Smalyshev: Enable dailies everywhere [puppet] - 10https://gerrit.wikimedia.org/r/456170 (https://phabricator.wikimedia.org/T201217) [16:01:11] (03CR) 10Volans: [C: 032] Change package name for PyPI [software/spicerack] - 10https://gerrit.wikimedia.org/r/456147 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:02:48] !log T202323 failover drbd primary from labstore1005 to labstore1004 [16:03:57] Anyone? [16:05:10] tgr: ^? It seems no-SWAT deployer [16:05:27] arounds [16:06:43] I can deploy, sure [16:06:51] Awesome! [16:07:36] thx [16:11:57] Aleksey_WMDE: you can test on mwdebug1002 [16:12:12] Got it. Give me 5 minutes [16:12:58] !log T202323 labstore1005 reboot [16:13:10] no stash bot :P [16:16:45] marostegui: can I get a hand with a db issue? I want to resize a VARCHAR field from 64 to 256 but am getting an error... [16:16:49] https://www.irccloud.com/pastebin/pOILFTOw/ [16:17:26] tgr: All seems to be good [16:17:35] (03Merged) 10jenkins-bot: Change package name for PyPI [software/spicerack] - 10https://gerrit.wikimedia.org/r/456147 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:17:35] (03PS2) 10Aleksey Bekh-Ivanov (WMDE): Wikidata: Use new item ID formatter for Q1-Q10000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456103 (https://phabricator.wikimedia.org/T201834) [16:17:38] (03CR) 10Imarlier: [C: 031] admin: add perf-team to webserver_misc_static [puppet] - 10https://gerrit.wikimedia.org/r/455602 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [16:18:04] andrewbogott: soo, there is a short answer and a long one [16:18:10] which one you want? :D [16:18:26] If there's an answer of the form 'just do X and never think about it again' then I want that one :) [16:18:40] 10Operations, 10ops-codfw, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs200[12] - https://phabricator.wikimedia.org/T202777 (10Gehel) error during reimage of wdqs2001: ``` ┌────────────────────┤ [!!] Partition disks ├─────────────────────┐ │... [16:18:47] which db? how much control you have over it? [16:19:04] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install analyticsmaster100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson) @elukey they are analyticsmaster100x I think during a discussion with ottomata we agreed to remove the hyphen. They are racked in 2 d... [16:19:14] It's on m5 — it's in active use and full of data so I can't really start from scratch [16:19:50] ok then there are 2 options, let me explain the error first [16:20:05] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:456103|Wikidata: Use new item ID formatter for Q1-Q10000 (T201834)]] (duration: 00m 56s) [16:20:10] (03PS1) 10RobH: fixing analyticsmaster1002 mgmt dns entry [dns] - 10https://gerrit.wikimedia.org/r/456173 [16:20:11] give me 3 min, tel [16:20:14] Aleksey_WMDE: it's live [16:20:24] Thanks. Will test more [16:20:42] (03PS1) 10Andrew Bogott: VPS puppet ENC: change max prefix size to 255 [puppet] - 10https://gerrit.wikimedia.org/r/456174 [16:20:50] volans: the context is ^ [16:20:52] (03CR) 10RobH: [C: 032] fixing analyticsmaster1002 mgmt dns entry [dns] - 10https://gerrit.wikimedia.org/r/456173 (owner: 10RobH) [16:21:06] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service, 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10ArielGlenn) p:05Triage>03High [16:21:13] (I'm actually interested in the long answer but only after the issue is resolved) [16:21:29] !log T202323 labstore1005 reboot (by arturo) [16:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:56] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:456103|Wikidata: Use new item ID formatter for Q1-Q10000 (T201834)]] (duration: 00m 56s) (by logmsgbot) [16:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:01] T201834: Use link formatter that uses cache instead of wb_terms for items Q1-Q10.000 - https://phabricator.wikimedia.org/T201834 [16:22:12] thanks rxy [16:22:16] :) [16:22:18] 10Operations: Support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T202255 (10ArielGlenn) p:05Triage>03High [16:22:25] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install analyticsmaster100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Ottomata) @Cmjohnson Rob's latest comment is what I thought we agreed to: > The solution is we'll order newer, longer hostname labels for those... [16:25:19] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install analyticsmaster100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10RobH) I'll handle the dns changes to change the hostname, the label won't change for now since the longer name won't fit. Chris: Go ahead and fin... [16:28:52] andrewbogott: back [16:29:53] so, the problem is that you have an index on that table, and that with the collation used it requires more space than allowed for an index [16:30:16] yep, makes sense. And I assume that varchar(1) is more than one byte, thanks to unicode? [16:30:16] (03PS1) 10Dzahn: mediawiki::maintenance: use mw_primary to enable/disable crons [puppet] - 10https://gerrit.wikimedia.org/r/456175 [16:30:35] andrewbogott: I think that can be resolve by this : ALTER TABLE `prefix` ROW_FORMAT=DYNAMIC; but I don't know that is ok for WMF sites. [16:30:37] one way to solve it is to have innodb_file_format = barracuda, innodb_large_prefix = 1 and alter the table to use ROW_FORMAT=DYNAMIC [16:30:47] we've done that on debmonitor's db with the DBAs [16:31:00] the other quick solution is to make it shorter, either the field or the index only [16:31:15] (03PS2) 10Dzahn: mediawiki::maintenance: use mw_primary to enable/disable crons [puppet] - 10https://gerrit.wikimedia.org/r/456175 (https://phabricator.wikimedia.org/T199073) [16:31:42] depends if you're using it as a primary/unique key or not (in that case the shorter index is not viable) [16:32:08] volans: 255 is the correct size but it will be very unusual for the actual field to exceed 64 (as far as I know today is the first time that's happened) [16:32:51] VARCHAR 255 = utf8mb * 4 * = 255 * 4 = 1020 [16:32:53] andrewbogott: if you're blocked, changed it to 191 (horrible size) [16:32:59] that table has three fields, id/poject/prefix [16:33:04] and then talk with the DBA for the proper fix [16:33:12] combination of project+prefix should be unique but prefix itself not [16:33:21] * volans logging to check the table schema [16:35:22] andrewbogott: so, the max you can set as is is 127 if I'm not mistaken [16:35:46] this is if you're blocked [16:35:49] that would fix the problem for pretty much ever, despite not being technically correct. [16:35:55] if you're not, open a task with DBAs for a proper fix [16:36:13] It's not an emergency so I'll open a task [16:36:17] thank you for explaining! [16:36:53] andrewbogott: there is ofc another solution [16:37:04] Which is not have an index on that field? [16:37:06] are project and prefix names ever support not-ascii? [16:37:15] heh, I don't know but maybe. [16:37:34] otherwise convert to ascii is another option, but you know the data you hold there :D [16:37:36] I mean, dns is unicode so probably we should try to be too [16:38:10] assuming openstack supports it all the way :D [16:38:46] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.99 seconds [16:39:16] volans: I think 191 is max len ... 191*4 = 764 , 192 *4 = 768 [16:39:30] rxy: yes, but they have UNIQUE KEY `project_prefix` (`project`,`prefix`) [16:39:37] adn `project` varchar(64) [16:39:46] ah, k [16:40:11] you couldn't know ;) [16:40:19] I just logged in there to check it [16:40:48] 191 - 64 = 127 [16:41:48] (03PS2) 10Andrew Bogott: VPS puppet ENC: change max prefix size to 255 [puppet] - 10https://gerrit.wikimedia.org/r/456174 [16:42:20] (03CR) 10jerkins-bot: [V: 04-1] VPS puppet ENC: change max prefix size to 255 [puppet] - 10https://gerrit.wikimedia.org/r/456174 (owner: 10Andrew Bogott) [16:42:34] * andrewbogott logs T203104 for now [16:43:15] (03PS3) 10Andrew Bogott: VPS puppet ENC: change max prefix size to 255 [puppet] - 10https://gerrit.wikimedia.org/r/456174 (https://phabricator.wikimedia.org/T203104) [16:45:25] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 329.98 seconds [16:45:35] PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 335.53 seconds [16:48:25] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install analyticsmaster100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson) [16:48:43] db2042 is lagging because of the BBU [16:48:47] I will force it to be WB [16:49:11] !log Force RAID controller to WB policy T202051 [16:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:16] T202051: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 [16:49:51] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install analyticsmaster100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson) @robh network ports are setup and in analytics vlan A6 ge-6/0/15 up up analytics-master1001 B8 ge-8/0/21 up up... [16:50:48] 10Operations, 10SRE-Access-Requests, 10Performance-Team (Radar): add perf-team admins to releases servers (was: webserver misc static servers) - https://phabricator.wikimedia.org/T202910 (10BBlack) On the whole `releases` vs `microsites` bit: * Both are behind varnish so they work for @Imarlier 's first cons... [16:50:55] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) I have forced db2042 to be WB again as it was lagging too much behind: ``` 16:45 < icinga-wm> PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag:... [16:52:52] Sorry andrewbogott - I can take a look next week I am supposed to be on holidays now :-) [16:52:56] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [16:53:06] RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [16:53:07] (03CR) 10Volans: "couple of comments inline, I see eventlet was already used, so not commenting on it." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/456167 (https://phabricator.wikimedia.org/T201858) (owner: 10Gilles) [16:53:13] 10Operations, 10ops-eqiad: rack/setup/install cloudservices1004.wikimedia.org - https://phabricator.wikimedia.org/T201341 (10RobH) So, this system is plugged into asw2-a-eqiad, which had deployment issues. @Cmjohnson will need to move this to asw-a-eqiad and update this task with the port. (It shows as alloc... [16:54:07] marostegui: sorry for the ping. I put a 'temporary' solution in place that should work for the next 5 years or so :) [16:54:42] andrewbogott: haha not bad then! Will take a look next week if no one does before! :) [16:59:13] (03PS1) 10Bstorm: labstore: set labstore1004 as the new primary [puppet] - 10https://gerrit.wikimedia.org/r/456177 (https://phabricator.wikimedia.org/T202323) [17:03:31] 10Operations, 10ops-codfw, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs200[12] - https://phabricator.wikimedia.org/T202777 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs2001.codfw.wmnet'] ``` The log can be found in `/var/log/w... [17:04:55] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /srv 51369 MB (10% inode=99%) [17:08:45] (03CR) 10Bstorm: [C: 032] labstore: set labstore1004 as the new primary [puppet] - 10https://gerrit.wikimedia.org/r/456177 (https://phabricator.wikimedia.org/T202323) (owner: 10Bstorm) [17:09:26] (03PS1) 10RobH: cloudservices1004 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/456179 (https://phabricator.wikimedia.org/T201341) [17:10:03] (03CR) 10RobH: [C: 032] cloudservices1004 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/456179 (https://phabricator.wikimedia.org/T201341) (owner: 10RobH) [17:11:36] 10Operations, 10ops-eqiad: rack/setup/install cloudservices1004.wikimedia.org - https://phabricator.wikimedia.org/T201341 (10RobH) a:05Cmjohnson>03RobH [17:12:16] PROBLEM - High lag on wdqs1004 is CRITICAL: 3602 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:12:43] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10faidon) Phew, that's a lot! So I think: - On Equinix, no LoA, but there was a thread with DR where they mentioned that they tracked it down and can hotcut it. @RobH was Cc'ed in that thre... [17:12:53] * gehel is looking at wdqs1004 [17:13:26] RECOVERY - Disk space on elastic1024 is OK: DISK OK [17:14:24] gehel: wdq4 is ok, I'm reloading categories there so it should be fine soon I think [17:14:36] SMalyshev: ok, I'll ack it [17:14:57] yea, if it's not fine in ~30 mins then we might want to check up again [17:28:33] 10Operations, 10ops-codfw, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs200[12] - https://phabricator.wikimedia.org/T202777 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs2001.codfw.wmnet'] ``` and were **ALL** successful. [17:29:42] (03CR) 10Dzahn: [C: 031] "{" [puppet] - 10https://gerrit.wikimedia.org/r/455369 (https://phabricator.wikimedia.org/T202819) (owner: 10Reedy) [17:32:07] (03PS1) 10RobH: cloudservices1004 mac address update [puppet] - 10https://gerrit.wikimedia.org/r/456183 (https://phabricator.wikimedia.org/T201341) [17:32:40] (03CR) 10RobH: [C: 032] cloudservices1004 mac address update [puppet] - 10https://gerrit.wikimedia.org/r/456183 (https://phabricator.wikimedia.org/T201341) (owner: 10RobH) [17:34:23] (03PS1) 10Cmjohnson: Removing puppet entries for decom host silver [puppet] - 10https://gerrit.wikimedia.org/r/456184 (https://phabricator.wikimedia.org/T191357) [17:34:46] (03CR) 10Dzahn: [C: 031] "> "title": "/etc/apache2/sites-available/fixcopyright.wikimedia.org.conf"," [puppet] - 10https://gerrit.wikimedia.org/r/455369 (https://phabricator.wikimedia.org/T202819) (owner: 10Reedy) [17:35:43] (03PS2) 10Cmjohnson: Removing puppet entries for decom host silver [puppet] - 10https://gerrit.wikimedia.org/r/456184 (https://phabricator.wikimedia.org/T191357) [17:36:36] (03CR) 10Cmjohnson: [C: 032] Removing puppet entries for decom host silver [puppet] - 10https://gerrit.wikimedia.org/r/456184 (https://phabricator.wikimedia.org/T191357) (owner: 10Cmjohnson) [17:41:06] (03PS1) 10Andrew Bogott: Openstack glance: make a /a symlink [puppet] - 10https://gerrit.wikimedia.org/r/456185 [17:41:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Bstorm) 05Open>03Resolved Looks happy now. ``` Smart Array P420i in Slot 0 (Embedded) array B Logical Drive: 2 Size: 2.2 TB Fault Tolerance:... [17:44:05] (03CR) 10Andrew Bogott: [C: 032] Openstack glance: make a /a symlink [puppet] - 10https://gerrit.wikimedia.org/r/456185 (owner: 10Andrew Bogott) [17:45:24] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Krinkle) @Imarlier Hm.. might be unrelated, but I see that those are all m-d... [17:47:22] (03CR) 10Dzahn: [C: 031] "looking at the excerpts from the full catalog on compiler above.. it will create the new site in sites-available but not symlink it to sit" [puppet] - 10https://gerrit.wikimedia.org/r/455369 (https://phabricator.wikimedia.org/T202819) (owner: 10Reedy) [17:51:28] !log disabling puppet on mw hosts as a precaution before deploying apache change [17:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:41] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) >>! In T199252#4543042, @Krinkle wrote: > @Imarlier Hm.. might be... [17:51:48] !log imarlier@deploy1001 Started deploy [performance/coal@7457c86]: (no justification provided) [17:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:54] !log imarlier@deploy1001 Finished deploy [performance/coal@7457c86]: (no justification provided) (duration: 00m 06s) [17:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:59] (03PS4) 10Dzahn: Add fixcopyright.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/455369 (https://phabricator.wikimedia.org/T202819) (owner: 10Reedy) [17:55:43] (03CR) 10Dzahn: [C: 032] Add fixcopyright.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/455369 (https://phabricator.wikimedia.org/T202819) (owner: 10Reedy) [17:58:00] (03PS1) 10Cmjohnson: Removing production dns for decom host silver [dns] - 10https://gerrit.wikimedia.org/r/456186 (https://phabricator.wikimedia.org/T191357) [17:58:49] (03PS2) 10Cmjohnson: Removing production dns for decom host silver [dns] - 10https://gerrit.wikimedia.org/r/456186 (https://phabricator.wikimedia.org/T191357) [17:59:12] (03CR) 10Cmjohnson: [C: 032] Removing production dns for decom host silver [dns] - 10https://gerrit.wikimedia.org/r/456186 (https://phabricator.wikimedia.org/T191357) (owner: 10Cmjohnson) [17:59:51] (03PS1) 10Smalyshev: Add health check for categories endpoint without lag check [puppet] - 10https://gerrit.wikimedia.org/r/456187 [17:59:53] (03CR) 10Jcrespo: [C: 031] "+1 to the idea of doing this for short term (switch), have etc in the long term, but I have not checked thoroughly for implementation corr" [puppet] - 10https://gerrit.wikimedia.org/r/456175 (https://phabricator.wikimedia.org/T199073) (owner: 10Dzahn) [18:00:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom silver/WMF3434 - https://phabricator.wikimedia.org/T191357 (10Cmjohnson) [18:00:30] (03CR) 10jerkins-bot: [V: 04-1] Add health check for categories endpoint without lag check [puppet] - 10https://gerrit.wikimedia.org/r/456187 (owner: 10Smalyshev) [18:01:43] (03PS2) 10Smalyshev: Add health check for categories endpoint without lag check [puppet] - 10https://gerrit.wikimedia.org/r/456187 [18:02:10] 10Operations, 10ops-eqiad, 10decommission, 10netops: unrack/decom pfw1-eqiad and pfw2-eqiad - https://phabricator.wikimedia.org/T183390 (10Cmjohnson) [18:04:43] RECOVERY - High lag on wdqs1004 is OK: (C)3600 ge (W)1200 ge 1008 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:07:14] 10Operations, 10ops-eqiad, 10decommission, 10netops: unrack/decom pfw1-eqiad and pfw2-eqiad - https://phabricator.wikimedia.org/T183390 (10Cmjohnson) [18:07:50] 10Operations, 10ops-eqiad, 10decommission, 10netops: unrack/decom pfw1-eqiad and pfw2-eqiad - https://phabricator.wikimedia.org/T183390 (10Cmjohnson) 05Open>03Resolved These are off the racks, zeroized, the scs ports were re-used with the new frack switches and the port descriptions were updated. [18:08:47] !log new apache config in sites-available for new site fixcopyright.wm is being generated by puppet on cluster, but not enabled yet (T202819) [18:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:52] T202819: Create production wiki: fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T202819 [18:10:43] !log puppet re-enabled on mw* via cumin before last log message (cumin expected to log to SAL automatically?) [18:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:19] (03PS1) 10Dzahn: beta: add fixcopyright.wm to Apache sites/wikimedia.conf [puppet] - 10https://gerrit.wikimedia.org/r/456192 (https://phabricator.wikimedia.org/T202819) [18:16:51] (03CR) 10Dzahn: "needs DNS record but how" [puppet] - 10https://gerrit.wikimedia.org/r/456192 (https://phabricator.wikimedia.org/T202819) (owner: 10Dzahn) [18:21:19] (03PS1) 10Dzahn: apache/mediawiki: include new "other wiki" fixcopyright in cluster config [puppet] - 10https://gerrit.wikimedia.org/r/456194 (https://phabricator.wikimedia.org/T202819) [18:22:19] (03PS1) 10Andrew Bogott: region-migrate: rearrange some things [puppet] - 10https://gerrit.wikimedia.org/r/456195 [18:27:24] (03PS1) 10Dzahn: apache::wikimedia: replace lone bugzilla reference with Phab [puppet] - 10https://gerrit.wikimedia.org/r/456197 [18:27:55] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cloudservices1004.wikimedia.org - https://phabricator.wikimedia.org/T201341 (10RobH) Ok, firmware updated on the bios and the network card, as they were outdated. Other firmware versions are up to date (according to support.dell.com for the s... [18:28:12] 10Operations, 10ops-eqiad: rack/setup/install cloudservices1004.wikimedia.org - https://phabricator.wikimedia.org/T201341 (10RobH) [18:28:57] 10Operations, 10Scap, 10Patch-For-Review: Intermittent git-fat failure during deploy - https://phabricator.wikimedia.org/T202100 (10thcipriani) >>! In T202100#4515951, @Ottomata wrote: > we just updated it for wqds* hosts. If that worked fine for @Gehel and Erik, we'll update the rest of the flee (all nodes... [18:31:10] 10Operations: rack/setup/install cloudservices1004.wikimedia.org - https://phabricator.wikimedia.org/T201341 (10RobH) a:05RobH>03Andrew Ok, the first puppet run fails due to: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a F... [18:31:28] !log debdeploy git-fat update for all nodes - T202100 [18:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:33] T202100: Intermittent git-fat failure during deploy - https://phabricator.wikimedia.org/T202100 [18:31:53] 10Operations, 10Release-Engineering-Team, 10Scap: find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10thcipriani) [18:32:05] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Cmjohnson) @Vgutierrez Not sure if this is you but before I complete the decom process for this I see these smokeping entries in puppet. modules/smokeping/files/config.d/Targets:+++ ra... [18:32:37] ottomata: awesome! thanks! [18:38:28] 10Operations, 10Scap, 10Patch-For-Review: Intermittent git-fat failure during deploy - https://phabricator.wikimedia.org/T202100 (10Ottomata) Done, worked everywhere except: ``` The following hosts were unreachable: cloudservices1004.wikimedia.org ``` [18:39:56] 10Operations, 10Scap, 10Patch-For-Review: Intermittent git-fat failure during deploy - https://phabricator.wikimedia.org/T202100 (10Dzahn) >>! In T202100#4543309, @Ottomata wrote: > The following hosts were unreachable: > cloudservices1004.wikimedia.org should be because of T201341#4543268 [18:46:53] (03CR) 10Dzahn: [C: 032] apache/mediawiki: include new "other wiki" fixcopyright in cluster config [puppet] - 10https://gerrit.wikimedia.org/r/456194 (https://phabricator.wikimedia.org/T202819) (owner: 10Dzahn) [18:47:24] (03CR) 10Dzahn: [C: 032] "tested and deploying on mwdebug first" [puppet] - 10https://gerrit.wikimedia.org/r/456194 (https://phabricator.wikimedia.org/T202819) (owner: 10Dzahn) [18:49:51] (03CR) 10Dzahn: [C: 032] "comments only" [puppet] - 10https://gerrit.wikimedia.org/r/456197 (owner: 10Dzahn) [18:51:56] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban), 10Services (watching): Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 (10thcipriani) [18:55:01] !log puppet deploy of new cluster apache site inclusion for fixcopyright.wm, tested on mwdebug100*,mw1269, apache-fast-test (T202819) (gerrit:456194) [18:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:07] T202819: Create production wiki: fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T202819 [18:57:17] (03CR) 10Dzahn: [C: 032] "apache-fast-test diff:" [puppet] - 10https://gerrit.wikimedia.org/r/456194 (https://phabricator.wikimedia.org/T202819) (owner: 10Dzahn) [18:58:15] Reedy: ^ createwiki should be unblocked in 30 min [19:00:04] marxarelli: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Americas version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180829T1900). [19:04:54] wee [19:11:53] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:14:12] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [19:17:40] 10Operations, 10Beta-Cluster-Infrastructure, 10Jenkins, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561 (10thcipriani) >>! In T192561#4540050, @Dzahn wrote: > Thanks, i see: > > ``` > # Parsoid Jav... [19:18:21] 10Operations, 10SRE-Access-Requests, 10Performance-Team (Radar): add perf-team admins to releases servers (was: webserver misc static servers) - https://phabricator.wikimedia.org/T202910 (10Dzahn) Thanks for the detailed answer, bblack. Alright, let's go with the easier solution of not moving the site anoth... [19:19:17] (03PS1) 10Dduvall: group1 to 1.32.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456202 [19:19:29] (03CR) 10Andrew Bogott: [C: 032] region-migrate: rearrange some things [puppet] - 10https://gerrit.wikimedia.org/r/456195 (owner: 10Andrew Bogott) [19:19:36] (03PS2) 10Andrew Bogott: region-migrate: rearrange some things [puppet] - 10https://gerrit.wikimedia.org/r/456195 [19:20:58] (03CR) 10Dduvall: [C: 032] group1 to 1.32.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456202 (owner: 10Dduvall) [19:21:36] !log Deploying 1.32.0-wmf.19 to group1 [19:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:21] (03Merged) 10jenkins-bot: group1 to 1.32.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456202 (owner: 10Dduvall) [19:24:05] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 to 1.32.0-wmf.19 [19:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:22] (03CR) 10jenkins-bot: group1 to 1.32.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456202 (owner: 10Dduvall) [19:28:00] (03PS1) 10Krinkle: noc: Add Cache-Control with short max-age for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456206 (https://phabricator.wikimedia.org/T202734) [19:29:54] (03PS1) 10Andrew Bogott: Add DNS entries for labs-ns2 and labs-recursor2 [dns] - 10https://gerrit.wikimedia.org/r/456207 (https://phabricator.wikimedia.org/T201341) [19:30:08] (03CR) 10jerkins-bot: [V: 04-1] Add DNS entries for labs-ns2 and labs-recursor2 [dns] - 10https://gerrit.wikimedia.org/r/456207 (https://phabricator.wikimedia.org/T201341) (owner: 10Andrew Bogott) [19:31:46] (03PS1) 10Andrew Bogott: eqiad1: add entries for cloudservices1004 [puppet] - 10https://gerrit.wikimedia.org/r/456208 (https://phabricator.wikimedia.org/T201341) [19:33:02] (03PS2) 10Andrew Bogott: Add DNS entries for labs-ns3 and labs-recursor3 [dns] - 10https://gerrit.wikimedia.org/r/456207 (https://phabricator.wikimedia.org/T201341) [19:33:50] (03CR) 10Andrew Bogott: [C: 032] Add DNS entries for labs-ns3 and labs-recursor3 [dns] - 10https://gerrit.wikimedia.org/r/456207 (https://phabricator.wikimedia.org/T201341) (owner: 10Andrew Bogott) [19:34:40] (03CR) 10Andrew Bogott: [C: 032] eqiad1: add entries for cloudservices1004 [puppet] - 10https://gerrit.wikimedia.org/r/456208 (https://phabricator.wikimedia.org/T201341) (owner: 10Andrew Bogott) [19:43:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Bstorm) @Cmjohnson Do the new spares we got fit this machine? [19:47:55] !log failed tools and cloud vps storage services over to labstore1004 [19:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:33] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:49:44] PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [19:54:13] RECOVERY - High load average on labstore1004 is OK: OK: Less than 50.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [19:57:13] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:00:01] 10Operations, 10Patch-For-Review: rack/setup/install cloudservices1004.wikimedia.org - https://phabricator.wikimedia.org/T201341 (10Andrew) 05Open>03Resolved puppet is running now. Thank you! [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: That opportune time is upon us again. Time for a Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180829T2000). [20:01:04] (03CR) 10BryanDavis: [C: 031] noc: Add Cache-Control with short max-age for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456206 (https://phabricator.wikimedia.org/T202734) (owner: 10Krinkle) [20:06:21] PROBLEM - Auth DNS on cloudservices1004 is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:08:10] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1004 is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:09:51] PROBLEM - Check for gridmaster host resolution UDP on cloudservices1004 is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:11:40] PROBLEM - Recursive DNS on 208.80.154.24 is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:13:21] PROBLEM - Check systemd state on cloudservices1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:16:13] 10Puppet, 10Phabricator: Local config file contains escape characters - https://phabricator.wikimedia.org/T103924 (10Aklapper) [20:16:38] 10Puppet, 10Phabricator: Local config file contains escape characters - https://phabricator.wikimedia.org/T103924 (10Aklapper) Where to see that "they get processed as regular string"? [20:17:49] (03PS1) 10Andrew Bogott: cloud pdns on Jessie: use $::fqdn rather than $host for db_host [puppet] - 10https://gerrit.wikimedia.org/r/456213 [20:24:15] (03PS1) 10Andrew Bogott: Added dummy pdns passwords for eqiad1. [labs/private] - 10https://gerrit.wikimedia.org/r/456214 [20:24:23] (03CR) 10Andrew Bogott: [V: 032 C: 032] Added dummy pdns passwords for eqiad1. [labs/private] - 10https://gerrit.wikimedia.org/r/456214 (owner: 10Andrew Bogott) [20:28:17] (03CR) 10Andrew Bogott: [C: 032] cloud pdns on Jessie: use $::fqdn rather than $host for db_host [puppet] - 10https://gerrit.wikimedia.org/r/456213 (owner: 10Andrew Bogott) [20:29:31] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:29:46] PROBLEM - drbd service on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit drbd is failed [20:30:11] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:30:47] PROBLEM - drbd service on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit drbd is failed [20:32:45] bstorm_: are those false positives from failing over, or real alerts? [20:32:51] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1004 is CRITICAL: DNS CRITICAL - 0.016 seconds response time (No ANSWER SECTION found) [20:33:22] Those are real. DRBD wasn't actually synced up following the failovers. I just clued the monitor it by kicking something. [20:33:36] (03PS1) 10Ottomata: Initial debian packaging version 0.208 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/456277 (https://phabricator.wikimedia.org/T203115) [20:33:50] (03CR) 10Subramanya Sastry: "Before merge, do make sure parser migration will not break since (at least) enwiki is still actively fixing pages." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [20:35:01] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1004 is CRITICAL: DNS CRITICAL - 0.011 seconds response time (No ANSWER SECTION found) [20:36:27] gotcha ok [20:39:20] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1004 is CRITICAL: DNS CRITICAL - 0.010 seconds response time (No ANSWER SECTION found) [20:41:02] (03PS2) 10Ottomata: Initial debian packaging version 0.208 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/456277 (https://phabricator.wikimedia.org/T203115) [20:41:46] (03PS1) 10Dzahn: admins: create new group sitemap-admins for sitemap uploaders [puppet] - 10https://gerrit.wikimedia.org/r/456279 (https://phabricator.wikimedia.org/T202910) [20:42:30] (03CR) 10jerkins-bot: [V: 04-1] admins: create new group sitemap-admins for sitemap uploaders [puppet] - 10https://gerrit.wikimedia.org/r/456279 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [20:44:01] RECOVERY - Auth DNS on cloudservices1004 is OK: DNS OK: 0.029 seconds response time. labs-ns3.wikimedia.org returns [20:44:05] (03PS1) 10BryanDavis: toolforge: Forward security@tools.wmflabs.org to security@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456280 (https://phabricator.wikimedia.org/T182812) [20:44:41] PROBLEM - Check for gridmaster host resolution TCP on cloudservices1004 is CRITICAL: DNS CRITICAL - 0.011 seconds response time (No ANSWER SECTION found) [20:45:35] (03PS4) 10Dzahn: admin: add sitemap-admins to webserver_misc_static [puppet] - 10https://gerrit.wikimedia.org/r/455602 (https://phabricator.wikimedia.org/T202910) [20:47:46] (03PS2) 10Dzahn: admins: create new group sitemap-admins for sitemap uploaders [puppet] - 10https://gerrit.wikimedia.org/r/456279 (https://phabricator.wikimedia.org/T202910) [20:48:29] 10Operations, 10Mail, 10Toolforge, 10Patch-For-Review, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812 (10bd808) >>! In T182812#3835406, @faidon wrote: > tools.wmflabs.org isn't a relay that is in production, so it is not (and cannot be... [20:49:26] (03PS3) 10Dzahn: admins: create new group sitemaps-admins for sitemaps uploaders [puppet] - 10https://gerrit.wikimedia.org/r/456279 (https://phabricator.wikimedia.org/T202910) [20:51:54] (03CR) 10Dzahn: [C: 032] admins: create new group sitemaps-admins for sitemaps uploaders [puppet] - 10https://gerrit.wikimedia.org/r/456279 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [20:56:02] (03PS5) 10Dzahn: admin: add sitemap-admins to webserver_misc_static [puppet] - 10https://gerrit.wikimedia.org/r/455602 (https://phabricator.wikimedia.org/T202910) [20:57:09] (03PS6) 10Dzahn: admin: add sitemap-admins to webserver_misc_static [puppet] - 10https://gerrit.wikimedia.org/r/455602 (https://phabricator.wikimedia.org/T202910) [20:59:15] (03CR) 10Dzahn: [C: 032] "this doesn't include sudo/root so we don't have put it in meeting and can just go ahead. in the interest in this being fairly time critica" [puppet] - 10https://gerrit.wikimedia.org/r/455602 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [21:01:10] (03PS1) 10Andrew Bogott: designate pdns: allow ipv6 access to mysql from other designate servers [puppet] - 10https://gerrit.wikimedia.org/r/456281 [21:02:20] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [21:02:49] (03CR) 10Andrew Bogott: [C: 032] designate pdns: allow ipv6 access to mysql from other designate servers [puppet] - 10https://gerrit.wikimedia.org/r/456281 (owner: 10Andrew Bogott) [21:14:43] (03PS1) 10Dzahn: microsites::sitemaps: set sgid bit on sitemaps doc root [puppet] - 10https://gerrit.wikimedia.org/r/456283 (https://phabricator.wikimedia.org/T202910) [21:15:28] (03CR) 10jerkins-bot: [V: 04-1] microsites::sitemaps: set sgid bit on sitemaps doc root [puppet] - 10https://gerrit.wikimedia.org/r/456283 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [21:16:50] (03PS2) 10Dzahn: microsites::sitemaps: set sgid bit on sitemaps doc root [puppet] - 10https://gerrit.wikimedia.org/r/456283 (https://phabricator.wikimedia.org/T202910) [21:18:12] (03CR) 10Dzahn: [C: 032] microsites::sitemaps: set sgid bit on sitemaps doc root [puppet] - 10https://gerrit.wikimedia.org/r/456283 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [21:21:08] (03PS1) 10Bstorm: drbd: There is a chicken/egg problem with timeouts this may fix [puppet] - 10https://gerrit.wikimedia.org/r/456284 (https://phabricator.wikimedia.org/T202323) [21:22:30] (03CR) 10Bstorm: [C: 032] drbd: There is a chicken/egg problem with timeouts this may fix [puppet] - 10https://gerrit.wikimedia.org/r/456284 (https://phabricator.wikimedia.org/T202323) (owner: 10Bstorm) [21:22:37] (03PS2) 10Bstorm: drbd: There is a chicken/egg problem with timeouts this may fix [puppet] - 10https://gerrit.wikimedia.org/r/456284 (https://phabricator.wikimedia.org/T202323) [21:25:31] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [21:27:46] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package of Blubber (0.5.0-1) - https://phabricator.wikimedia.org/T203121 (10thcipriani) [21:28:50] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:32:16] (03PS1) 10Dzahn: microsites::sitemaps: use 'recurse => true' to manage permissions [puppet] - 10https://gerrit.wikimedia.org/r/456285 (https://phabricator.wikimedia.org/T202910) [21:33:10] (03PS1) 10Bstorm: drbd: adding more timeouts so they work together right [puppet] - 10https://gerrit.wikimedia.org/r/456286 (https://phabricator.wikimedia.org/T202323) [21:34:01] (03CR) 10Bstorm: [C: 032] drbd: adding more timeouts so they work together right [puppet] - 10https://gerrit.wikimedia.org/r/456286 (https://phabricator.wikimedia.org/T202323) (owner: 10Bstorm) [21:35:57] (03PS2) 10Dzahn: microsites::sitemaps: use 'recurse => true' to manage permissions [puppet] - 10https://gerrit.wikimedia.org/r/456285 (https://phabricator.wikimedia.org/T202910) [21:36:42] (03PS3) 10Dzahn: microsites::sitemaps: use 'recurse => true' to manage permissions [puppet] - 10https://gerrit.wikimedia.org/r/456285 (https://phabricator.wikimedia.org/T202910) [21:37:40] (03PS4) 10Dzahn: microsites::sitemaps: use 'recurse => true' to manage permissions [puppet] - 10https://gerrit.wikimedia.org/r/456285 (https://phabricator.wikimedia.org/T202910) [21:37:46] (03CR) 10Dzahn: [C: 032] microsites::sitemaps: use 'recurse => true' to manage permissions [puppet] - 10https://gerrit.wikimedia.org/r/456285 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [21:40:41] (03PS1) 10Herron: puppet_compiler: temporarily proxy two project names with nginx [puppet] - 10https://gerrit.wikimedia.org/r/456287 (https://phabricator.wikimedia.org/T191438) [21:41:23] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) Marking as ""under discussion" on the RFC board for now. One thing that I believe would move this... [21:41:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10Dzahn) [21:44:50] (03PS1) 10BryanDavis: striker: Point at cloudcontrol1003 for OpenStack APIs [puppet] - 10https://gerrit.wikimedia.org/r/456288 (https://phabricator.wikimedia.org/T201504) [21:46:31] (03CR) 10BryanDavis: "This is a follow up to Arturo's change in I3fc7b79465213d142e022e485dbd80405ab7f530. That patch will only take effect with a new deploymen" [puppet] - 10https://gerrit.wikimedia.org/r/456288 (https://phabricator.wikimedia.org/T201504) (owner: 10BryanDavis) [21:46:35] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10Dzahn) This should be resolved now: - sitemaps-admins is a group that has all of perf-team... [21:51:30] 10Operations, 10Puppet, 10puppet-compiler, 10Patch-For-Review, 10User-herron: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 (10herron) Stretch compilers `compiler1001.puppet-diffs.eqiad.wmflabs` and `compiler1002.puppet-diffs.eqiad.wmflabs` are up and running, and a few... [21:51:58] (03PS1) 10Andrew Bogott: keystone: open firewall to a fourth designate host [puppet] - 10https://gerrit.wikimedia.org/r/456290 [21:55:20] (03CR) 10Andrew Bogott: [C: 032] keystone: open firewall to a fourth designate host [puppet] - 10https://gerrit.wikimedia.org/r/456290 (owner: 10Andrew Bogott) [22:05:21] (03PS1) 10Rush: labstore: IPs on secondary interface should match drbd settings [puppet] - 10https://gerrit.wikimedia.org/r/456292 [22:06:32] PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [22:09:21] (03PS1) 10Andrew Bogott: Fix ipv6 ptr record for cloudservices1004.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/456293 (https://phabricator.wikimedia.org/T201341) [22:09:47] (03CR) 10Andrew Bogott: [C: 032] Fix ipv6 ptr record for cloudservices1004.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/456293 (https://phabricator.wikimedia.org/T201341) (owner: 10Andrew Bogott) [22:12:05] (03CR) 10Bstorm: "possibly more of a fix than expected :) https://puppet-compiler.wmflabs.org/compiler02/12286/" [puppet] - 10https://gerrit.wikimedia.org/r/456292 (owner: 10Rush) [22:12:21] (03CR) 10Bstorm: [C: 032] labstore: IPs on secondary interface should match drbd settings [puppet] - 10https://gerrit.wikimedia.org/r/456292 (owner: 10Rush) [22:16:30] RECOVERY - Recursive DNS on 208.80.154.24 is OK: DNS OK: 0.093 seconds response time. www.wikipedia.org returns 208.80.154.224 [22:17:53] (03CR) 10Krinkle: sitemaps: Generalize varnish rule for sitemaps, to apply to all domains (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/456169 (https://phabricator.wikimedia.org/T198965) (owner: 10Imarlier) [22:19:41] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [22:20:26] RECOVERY - drbd service on labstore1004 is OK: OK - drbd is active [22:21:46] RECOVERY - drbd service on labstore1005 is OK: OK - drbd is active [22:28:21] PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [22:31:15] apergos: mutante: if you have space for a small config change for noc.wm.o; – https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/456206/ [22:36:19] (03PS1) 10Andrew Bogott: Revert "Fix ipv6 ptr record for cloudservices1004.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/456298 [22:36:33] (03CR) 10Smalyshev: [C: 04-1] "To be merged when HD upgrades are done." [puppet] - 10https://gerrit.wikimedia.org/r/456170 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [22:36:54] (03CR) 10RobH: [C: 031] Revert "Fix ipv6 ptr record for cloudservices1004.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/456298 (owner: 10Andrew Bogott) [22:37:06] (03CR) 10Andrew Bogott: [C: 032] Revert "Fix ipv6 ptr record for cloudservices1004.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/456298 (owner: 10Andrew Bogott) [22:41:12] RECOVERY - High load average on labstore1004 is OK: OK: Less than 50.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [22:59:27] (03PS1) 10Bstorm: labstore: Allow override for load monitoring [puppet] - 10https://gerrit.wikimedia.org/r/456300 (https://phabricator.wikimedia.org/T202323) [22:59:50] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180829T2300). [23:00:05] RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:05] (03CR) 10jerkins-bot: [V: 04-1] labstore: Allow override for load monitoring [puppet] - 10https://gerrit.wikimedia.org/r/456300 (https://phabricator.wikimedia.org/T202323) (owner: 10Bstorm) [23:02:11] (03PS2) 10Bstorm: labstore: Allow override for load monitoring [puppet] - 10https://gerrit.wikimedia.org/r/456300 (https://phabricator.wikimedia.org/T202323) [23:05:54] (03CR) 10Andrew Bogott: [C: 031] "I hate that those servers are so busy, but this change looks correct nonetheless" [puppet] - 10https://gerrit.wikimedia.org/r/456300 (https://phabricator.wikimedia.org/T202323) (owner: 10Bstorm) [23:10:04] (03CR) 10Bstorm: [C: 032] labstore: Allow override for load monitoring [puppet] - 10https://gerrit.wikimedia.org/r/456300 (https://phabricator.wikimedia.org/T202323) (owner: 10Bstorm) [23:17:11] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:19:44] (03CR) 10Volans: "It seems to me that with --ping this check will never be CRITICAL or WARNING, could be just OK or UNKNOWN. Is that the intended behaviour?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456187 (owner: 10Smalyshev) [23:23:26] (03CR) 10Volans: [C: 04-2] "This was the previous behaviour, and was fixed because it was wrong, as didn't allow to install a new maintenance host in the same DC beca" [puppet] - 10https://gerrit.wikimedia.org/r/456175 (https://phabricator.wikimedia.org/T199073) (owner: 10Dzahn) [23:27:47] Hah I almost missed my own SWAT [23:27:49] I guess I'll do it myself [23:55:10] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.19/extensions/Translate/stringmangler/StringMatcher.php: T202058 (duration: 00m 59s) [23:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:37] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.18/extensions/Translate/stringmangler/StringMatcher.php: T202058 (duration: 00m 56s) [23:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log