[00:09:54] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.384 second response time [00:13:14] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:31:15] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.150 second response time [00:36:54] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:39:04] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.996 second response time [00:39:45] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.014 second response time [00:42:25] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:14] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:04] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.021 second response time [01:01:24] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:55] PROBLEM - dhclient process on notebook1003 is CRITICAL: Return code of 255 is out of bounds [01:12:04] PROBLEM - DPKG on notebook1003 is CRITICAL: Return code of 255 is out of bounds [01:12:15] PROBLEM - MD RAID on notebook1003 is CRITICAL: Return code of 255 is out of bounds [01:12:35] PROBLEM - configured eth on notebook1003 is CRITICAL: Return code of 255 is out of bounds [01:12:35] PROBLEM - Check systemd state on notebook1003 is CRITICAL: Return code of 255 is out of bounds [01:12:45] PROBLEM - Disk space on notebook1003 is CRITICAL: Return code of 255 is out of bounds [01:16:44] PROBLEM - puppet last run on notebook1003 is CRITICAL: Return code of 255 is out of bounds [01:16:54] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [01:16:54] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [01:17:05] RECOVERY - Disk space on notebook1003 is OK: DISK OK [01:17:24] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [01:17:25] RECOVERY - DPKG on notebook1003 is OK: All packages OK [01:17:35] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [01:21:45] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:36:54] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:48:18] legoktm, hmm my service is not showing up on dpkg -c [01:48:54] Krenair: did you update compat and debhelper to 10? [01:48:58] I do have a debian/certcentral.service file [01:49:04] compat is 10 [01:49:16] debhelper build-depends version is >= 9 [01:49:33] that should be >= 10 [01:49:36] gonna try making that 10 [01:49:52] yeah [01:49:56] and are you building this on stretch? (or in a stretch chroot) [01:50:20] hm still does this the same thing [01:50:34] I'm building on bionic [01:51:40] I've got debhelper 11.1.6 [01:52:06] hm, wanna update the patch so I can try building/debugging it? [01:52:47] (03PS9) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [01:53:48] I do see that the dh-systemd package here says [01:54:00] Description-en_GB: debhelper add-on to handle systemd unit files - transitional package [01:54:00] This package is for transitional purposes and can be removed safely. [01:54:12] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [01:54:27] dpkg -L shows dh-systemd only creates copyright and changelog.gz files [01:56:46] (03PS10) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [01:57:36] Krenair: oh, I'm stupid [01:58:08] * legoktm quickly tests [01:58:44] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:58:50] km@km-pt:/srv/operations/software/certcentral$ dpkg -c ../python3-certcentral_0.1_all.deb | grep systemd [01:58:50] drwxr-xr-x root/root 0 2018-08-29 14:27 ./lib/systemd/ [01:58:50] drwxr-xr-x root/root 0 2018-08-29 14:27 ./lib/systemd/system/ [01:58:50] -rw-r--r-- root/root 232 2018-08-29 14:27 ./lib/systemd/system/python3-certcentral.service [01:59:23] Krenair: the file is supposed to be named .service, except the name of the binary package isn't certcentral, it's python3-certcentral. [01:59:52] ah. [02:00:23] also, since there's only one binary package, you can name it debian/service and it'll auto figure out the name [02:00:32] legoktm, that did the trick, thanks [02:02:01] (03PS11) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [02:02:26] Krenair: I'd also recommend looking into git-buildpackage and https://wiki.debian.org/git-pbuilder so that way you're building in an isolated environment [02:02:48] I'm using git-buildpackage [02:04:38] https://paste.fedoraproject.org/paste/CjzsJQdT8qVgxRPR4vQIpw is my ~/.gbp.conf [02:38:07] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.19) (duration: 14m 46s) [02:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:04] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.638 second response time [02:48:17] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Sep 4 02:48:17 UTC 2018 (duration 10m 10s) [02:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:25] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:54:14] legoktm, E: python3-certcentral: init.d-script-not-included-in-package etc/init.d/python3-certcentral [02:54:18] is an init.d script not optional? [02:56:01] > The /etc/init.d script is registered in the postinst script, but is not included in the package. [02:56:27] Krenair: I think you might need https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/debian/+/master/debian/rules#48 ? [03:01:21] (03PS25) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [03:01:48] legoktm, thanks [03:02:11] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [03:03:36] (03PS26) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [03:04:13] "For the benefit of users of 80x25 terminals" [03:04:15] * Krenair sighs [03:04:18] 2018. [03:06:59] :) [03:07:09] you mean you don't have your 4k monitor zoomed in so much that it only has 80 columns? [03:08:15] * Krinkle staging on mwdebug1002/deployment [03:09:30] * Krinkle measures line length of a full line in this channel, Chromium/IRCCloud. only 161. [03:09:40] I'd expect more than 2x since 80 columns, but alas. [03:10:24] that's 15", HiDPI, not 4K [03:12:46] of course when you run dpkg -I it quite happily outputs the 203 character depends line without linebreaks [03:14:49] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/extensions/Kartographer: I351259c46 (duration: 00m 51s) [03:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:01] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/extensions/WikimediaEvents/: Ie5c8f5877b (duration: 00m 50s) [03:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:16] (03PS12) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [03:21:39] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [03:23:54] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.945 second response time [03:24:18] (03CR) 10Alex Monk: "recheck" [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [03:25:39] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [03:27:15] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:27:31] (03CR) 10Alex Monk: "recheck" [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [03:29:37] 10Operations, 10Traffic: certcentral: phantom test failure around challenge success - https://phabricator.wikimedia.org/T203422 (10Krenair) p:05Triage>03Normal [03:39:14] (03PS13) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [03:39:53] (03PS14) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [03:41:10] 10Operations, 10Traffic: certcentral: Provide script for certificate revocation - https://phabricator.wikimedia.org/T203423 (10Krenair) p:05Triage>03Normal [03:41:37] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [03:42:58] (03PS15) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [03:44:22] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [03:44:30] (03CR) 10Alex Monk: [C: 031] "good, though I'm providing a config.example.yaml in the package which you could put this into?" [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [03:48:12] (03CR) 10Alex Monk: "recheck" [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [03:49:35] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [03:50:56] (03PS16) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [03:52:16] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [03:53:30] (03PS17) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [04:01:22] legoktm, I think that leaves https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/456646/7/debian/source/format which we should talk about later [04:23:25] Krenair: I think for now you should probably have it as 3.0 (native) and then switch it to non-native later on [04:29:35] PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [70.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [04:40:25] RECOVERY - High load average on labstore1004 is OK: OK: Less than 50.00% above the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [05:03:24] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457810 [05:06:17] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457810 (owner: 10Marostegui) [05:07:35] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457810 (owner: 10Marostegui) [05:08:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1086 (duration: 00m 49s) [05:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457811 [05:10:24] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.810 second response time [05:12:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457811 (owner: 10Marostegui) [05:13:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457811 (owner: 10Marostegui) [05:13:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457810 (owner: 10Marostegui) [05:13:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457811 (owner: 10Marostegui) [05:13:44] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:14:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1079 (duration: 00m 50s) [05:14:28] !log Deploy schema change on db1079 (this will generate lag on labsdb:s7) [05:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:55] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.252 second response time [05:19:14] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:26:54] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.635 second response time [05:30:14] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:05] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.291 second response time [06:05:09] (03CR) 10Giuseppe Lavagetto: "I'm pretty neutral about this - I'm not even sure what memcached would log with -v;" [puppet] - 10https://gerrit.wikimedia.org/r/456096 (owner: 10Elukey) [06:08:24] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:08:42] (03PS3) 10Muehlenhoff: Extend Imagemagick policy file to disable Postscript/PDF [puppet] - 10https://gerrit.wikimedia.org/r/454544 [06:16:32] (03CR) 10Muehlenhoff: [C: 032] Extend Imagemagick policy file to disable Postscript/PDF [puppet] - 10https://gerrit.wikimedia.org/r/454544 (owner: 10Muehlenhoff) [06:20:06] !log upgrading labweb* to wikidiff 1.7.3 [06:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:21] 10Operations, 10ops-codfw: mw2213 correctable memory errors - https://phabricator.wikimedia.org/T194172 (10elukey) >>! In T194172#4553936, @MoritzMuehlenhoff wrote: > @Joe , @elukey : Any objections? Otherwise I'll turn this into a decom ticket. We'd go from 6 to 5 api servers in ROW 5, I think that there is... [06:32:23] !log upgrading deploy* to wikidiff 1.7.3 [06:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:02] !log rebooting deploy2001 for kernel security update [06:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:43] 10Operations, 10MediaWiki-Cache: Improve memcache logs on mc* hosts - https://phabricator.wikimedia.org/T203429 (10elukey) p:05Triage>03Normal [06:58:53] (03PS6) 10Elukey: memcached: enable basic logging with the -v parameter [puppet] - 10https://gerrit.wikimedia.org/r/456096 (https://phabricator.wikimedia.org/T203429) [07:00:56] !log upgrading remaining appservers/API servers to wikidiff 1.7.3 [07:00:57] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457822 [07:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:19] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10User-Elukey: Improve memcache logs on mc* hosts - https://phabricator.wikimedia.org/T203429 (10elukey) [07:02:36] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457822 (owner: 10Marostegui) [07:03:59] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457822 (owner: 10Marostegui) [07:04:12] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457822 (owner: 10Marostegui) [07:05:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1079 (duration: 00m 54s) [07:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:19] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457825 [07:07:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457825 (owner: 10Marostegui) [07:09:19] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457825 (owner: 10Marostegui) [07:10:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1094 (duration: 00m 52s) [07:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:17] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457825 (owner: 10Marostegui) [07:32:54] PROBLEM - Keyholder SSH agent on deploy2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [07:43:43] ^ fixing [07:44:46] !log rearmed keyholder on deploy2001 [07:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:04] RECOVERY - Keyholder SSH agent on deploy2001 is OK: OK: Keyholder is armed with all configured keys. [07:52:00] !log rolling restart of elasticsearch / cirrus / codfw for various updates and data directory migration completed - T198351 [07:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:05] T198351: Refactor puppet to support multiple elasticsearch instances on same node - https://phabricator.wikimedia.org/T198351 [08:01:24] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:02:01] (03PS1) 10Aleksey Bekh-Ivanov (WMDE): Wikidata: Use new item ID formatter for Q1-Q100000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457831 (https://phabricator.wikimedia.org/T201835) [08:02:13] (03CR) 10jerkins-bot: [V: 04-1] Wikidata: Use new item ID formatter for Q1-Q100000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457831 (https://phabricator.wikimedia.org/T201835) (owner: 10Aleksey Bekh-Ivanov (WMDE)) [08:02:29] (03PS2) 10Aleksey Bekh-Ivanov (WMDE): Wikidata: Use new item ID formatter for Q1-Q100000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457831 (https://phabricator.wikimedia.org/T201835) [08:03:34] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:04:30] (03CR) 10Aleksey Bekh-Ivanov (WMDE): "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457831 (https://phabricator.wikimedia.org/T201835) (owner: 10Aleksey Bekh-Ivanov (WMDE)) [08:04:37] !log upgrading mwmaint* to wikidiff 1.7.3 [08:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:24] (03PS5) 10Volans: sre.switchdc.mediawiki: add Phase 4 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) [08:06:28] (03CR) 10Volans: sre.switchdc.mediawiki: add Phase 4 cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:06:40] 10Operations, 10TCB-Team, 10WMDE-QWERTY-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-08-29: Release and deploy wikidiff2 v1.7.3 - https://phabricator.wikimedia.org/T202301 (10MoritzMuehlenhoff) 05Open>03Resolved 1.7.3 has been rolled out to the app servers (some in codfw still need the update, this w... [08:07:41] (03PS2) 10Jcrespo: mariadb-backups: Provide backup file metadata information [puppet] - 10https://gerrit.wikimedia.org/r/456608 (https://phabricator.wikimedia.org/T198987) [08:09:18] the memcached errors are the "usual" mc1035 problem [08:09:27] (03CR) 10Jcrespo: [C: 032] mariadb-backups: Provide backup file metadata information [puppet] - 10https://gerrit.wikimedia.org/r/456608 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [08:09:47] (03PS2) 10Jcrespo: mariadb-backups: Calculate total backup size [puppet] - 10https://gerrit.wikimedia.org/r/456613 (https://phabricator.wikimedia.org/T198987) [08:10:39] (03CR) 10Jcrespo: [C: 032] mariadb-backups: Calculate total backup size [puppet] - 10https://gerrit.wikimedia.org/r/456613 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [08:19:43] (03PS3) 10Jcrespo: mysql-prometheus-exporter: Fix deleted x1 instance from dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/457499 [08:20:00] (03CR) 10Vgutierrez: [C: 04-1] Packaging stuff and readme (035 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [08:20:30] 10Operations, 10ops-codfw, 10decommission: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10MoritzMuehlenhoff) [08:24:46] (03CR) 10Muehlenhoff: Packaging stuff and readme (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [08:28:40] !log starting rolling restart of elasticsearch / cirrus / eqiad for various updates and data directory migration - T198351 [08:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:48] T198351: Refactor puppet to support multiple elasticsearch instances on same node - https://phabricator.wikimedia.org/T198351 [08:29:15] (03PS2) 10Gehel: elasticsearch: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456138 (https://phabricator.wikimedia.org/T198351) [08:29:41] (03PS1) 10Volans: spicerack: add redis sessions configuration [puppet] - 10https://gerrit.wikimedia.org/r/457836 (https://phabricator.wikimedia.org/T199079) [08:30:42] (03CR) 10jerkins-bot: [V: 04-1] spicerack: add redis sessions configuration [puppet] - 10https://gerrit.wikimedia.org/r/457836 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:30:44] (03CR) 10Gehel: [C: 032] elasticsearch: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456138 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel) [08:32:18] (03PS2) 10Volans: spicerack: add redis sessions configuration [puppet] - 10https://gerrit.wikimedia.org/r/457836 (https://phabricator.wikimedia.org/T199079) [08:39:03] (03CR) 10Volans: "Puppet compiler available here: https://puppet-compiler.wmflabs.org/compiler02/12341/sarin.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/457836 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:42:05] (03PS1) 10Jcrespo: mariadb: Depool db1114 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457839 [08:43:38] (03PS4) 10Jcrespo: mysql-prometheus-exporter: Fix deleted x1 instance from dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/457499 [08:44:16] (03CR) 10Jcrespo: [C: 032] mysql-prometheus-exporter: Fix deleted x1 instance from dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/457499 (owner: 10Jcrespo) [08:45:46] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1114 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457839 (owner: 10Jcrespo) [08:45:50] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457841 [08:46:52] (03PS1) 10Elukey: role::analytics_cluster::coordinator: tweak profile::base's contacts [puppet] - 10https://gerrit.wikimedia.org/r/457842 (https://phabricator.wikimedia.org/T172532) [08:47:18] (03Merged) 10jenkins-bot: mariadb: Depool db1114 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457839 (owner: 10Jcrespo) [08:47:56] (03CR) 10Elukey: [C: 032] role::analytics_cluster::coordinator: tweak profile::base's contacts [puppet] - 10https://gerrit.wikimedia.org/r/457842 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [08:48:21] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457841 [08:50:39] (03CR) 10Marostegui: "Again that host? :-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457839 (owner: 10Jcrespo) [08:51:48] jynus: you deploying? [08:53:05] just merge both at the same time [08:53:17] should I merge mine then? [08:53:21] yes [08:53:24] ok [08:53:28] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457841 (owner: 10Marostegui) [08:53:35] I was preparing codfw for dc switch [08:53:51] (03CR) 10jenkins-bot: mariadb: Depool db1114 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457839 (owner: 10Jcrespo) [08:53:53] Do you want me to deploy once my change is merged? [08:54:00] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457841 (owner: 10Marostegui) [08:54:08] -1? come on jenkins [08:54:31] (03CR) 10Marostegui: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457841 (owner: 10Marostegui) [08:54:53] * elukey supports -1s to marostegui [08:55:03] xdddd [08:56:09] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457841 (owner: 10Marostegui) [08:56:15] unhappy jenkins is unhappy [08:56:17] elukey: ^ [08:56:34] PROBLEM - Check systemd state on analytics1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:56:46] this is me testing --^ [08:57:06] (03PS1) 10Jcrespo: mariadb: Fix DB configuration in preparation for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457847 (https://phabricator.wikimedia.org/T189107) [08:57:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1094, depool db1114 (duration: 00m 50s) [08:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:06] jynus: change merged and deployed [08:58:17] thanks [08:58:45] RECOVERY - Check systemd state on analytics1003 is OK: OK - running: The system is fully operational [09:01:17] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1002.eqiad.wmnet, scb1004.eqiad.wmnet are marked down but pooled [09:01:31] (03PS2) 10Muehlenhoff: Remove now obsolete Hiera setting profile::base::enable_microcode [puppet] - 10https://gerrit.wikimedia.org/r/457532 [09:02:27] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1002.eqiad.wmnet are marked down but pooled [09:02:59] uh [09:05:07] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:05:51] PROBLEM - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:06:10] there we go [09:06:10] wah wah [09:06:19] <_joe_> it's 2 days it's suffering [09:06:24] * volans restarting on scb1004 [09:06:27] <_joe_> I assumed someone was looking into it [09:06:49] !log restarted pdfrender on scb1004 [09:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:55] <_joe_> or someone else can write the cookbook for switching services over, ofc [09:06:56] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:07:07] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.004 second response time [09:07:41] !log restarted pdfrender on scb1002 [09:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:52] <_joe_> volans: ouch, I was attached with strace [09:07:57] <_joe_> on 1002 :P [09:07:58] _joe_: sorry [09:08:03] <_joe_> it was you then [09:08:08] <_joe_> I thought I segfaulted the thing [09:08:11] lol [09:08:18] why LVS is not recovering? [09:08:27] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [09:08:37] <_joe_> it is [09:08:37] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [09:08:43] volans: see you just needed to politely ask [09:08:50] RECOVERY - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.004 second response time [09:08:51] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [09:09:04] 1001 failing too, _joe_ do you want to keep that for debugging? [09:09:09] <_joe_> scb1003.eqiad.wmnet and scb1001.eqiad.wmnet still down [09:09:12] <_joe_> volans: yes please [09:09:27] <_joe_> volans: wasn't 1004 depooled? [09:09:44] no, it was repooled after alex told me to, that proton will replace all of this [09:09:53] <_joe_> sure, sure [09:09:54] and just restart it [09:09:59] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457841 (owner: 10Marostegui) [09:10:03] <_joe_> proton is alerting since 1 week or so [09:10:06] this is becoming a nuisance though [09:10:26] I thought it was on of those once off incidents [09:10:31] it's already 3 times in 4 days [09:10:40] it mostly a continuum [09:10:41] than 3 times [09:10:53] !log restarted pdfrender on scb1003 [09:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:02] _joe_: I'll leave 1001 untouched, depool it at will [09:11:15] would it be an option to depool eqiad ? [09:11:17] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.005 second response time [09:11:35] if it's load related codfw is better equipped to handle the load [09:11:49] otherwise we are looking into debugging a barely maintained service [09:12:06] <_joe_> akosiaris: go ahead and depool eqiad [09:12:56] !log depool pdfrender in eqiad, hopefully codfw will be better equipped to handle the load [09:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:55] it's so badly maintained we don't even have grafana graphs for it [09:15:12] https://grafana.wikimedia.org/dashboard/db/mediawiki-electronpdfservice?orgId=1 seems to be mediawiki extensions specific [09:15:28] with some ocg sprinkled on top of it [09:15:40] em sorry ocg v2 I mean (the actual ocg) [09:15:49] <_joe_> ok so [09:16:55] <_joe_> the main process on scb1001 is blocked in an epoll_wait on an anon_inode [09:17:02] _joe_: fully unrelated but saying so I don't forget it. I 've exposed the host CPU to proton1001 in case that would help proton fix those bad times at https://grafana.wikimedia.org/dashboard/db/service-endpoint-performance?panelId=2&fullscreen&orgId=1 [09:17:05] it did not help [09:17:26] <_joe_> which AFAIR means it's waiting for incoming requests [09:18:00] (03PS1) 10Ema: cache_canary: add codfw backends for api/appservers [puppet] - 10https://gerrit.wikimedia.org/r/457849 (https://phabricator.wikimedia.org/T199079) [09:18:02] (03PS1) 10Ema: cache_canary: switch mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/457850 (https://phabricator.wikimedia.org/T199079) [09:19:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457851 [09:20:07] still however... scb1001 hasn't recovered [09:20:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457851 (owner: 10Marostegui) [09:21:27] <_joe_> akosiaris: yeah I'm looking at what happens when I make a request [09:21:35] <_joe_> it's pretty strange [09:21:46] electron strange ? [09:21:52] that would be a first [09:22:04] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457851 (owner: 10Marostegui) [09:23:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3317 (duration: 00m 49s) [09:23:18] !log Deploy schema change on db1090:3317 [09:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:24] <_joe_> also TIL it has buffered output [09:23:32] <_joe_> so the access logs are useless [09:23:38] <_joe_> they have the timestamps all wrong [09:24:06] !log stop, upgrade and running analyze on db1114 [09:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457851 (owner: 10Marostegui) [09:28:03] for some reason, some hosts provide bad query plans, I am researching on db1114 so we can apply it to codfw hosts if necessary [09:28:24] (wrong channel) [09:29:30] 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10akosiaris) [09:32:23] !log restart ircecho as attempt to see if icinga-vm re-joins the analytics chan [09:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:36] (03CR) 10Ema: [C: 032] cache_canary: add codfw backends for api/appservers [puppet] - 10https://gerrit.wikimedia.org/r/457849 (https://phabricator.wikimedia.org/T199079) (owner: 10Ema) [09:36:55] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551 (10jcrespo) 05Open>03Resolved [09:40:36] !log deploying latest event schedulers to all core db hosts [09:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:20] (03CR) 10Marostegui: [C: 04-1] "Mostly nitpick, the -1 is for for line 709 which needs to be commented out" (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457847 (https://phabricator.wikimedia.org/T189107) (owner: 10Jcrespo) [09:45:09] (03PS1) 10Banyek: Labs: Add support for socket path and/or port (multiinstance support) to redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) [09:45:19] (03PS1) 10Volans: sre.switchdc.mediawiki: make space for new Phase 5 [cookbooks] - 10https://gerrit.wikimedia.org/r/457858 (https://phabricator.wikimedia.org/T199079) [09:45:21] (03PS1) 10Volans: sre.switchdc.mediawiki: add Phase 5 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/457859 (https://phabricator.wikimedia.org/T199079) [09:45:33] (03CR) 10Jcrespo: "comments" (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457847 (https://phabricator.wikimedia.org/T189107) (owner: 10Jcrespo) [09:45:51] (03CR) 10jerkins-bot: [V: 04-1] Labs: Add support for socket path and/or port (multiinstance support) to redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) (owner: 10Banyek) [09:45:53] (03CR) 10Volans: "This depends on Ibc4313fd6b873046b8b7eb78857d18b3f760c286" [cookbooks] - 10https://gerrit.wikimedia.org/r/457859 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:48:20] (03PS2) 10Banyek: Labs: Add support for socket path and/or port to redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) [09:48:42] (03CR) 10Marostegui: "Removing -1, only pending then to decide what to do with line 330 for now (line 182 can wait a bit)" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457847 (https://phabricator.wikimedia.org/T189107) (owner: 10Jcrespo) [09:53:40] (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [cookbooks] - 10https://gerrit.wikimedia.org/r/457858 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:54:18] (03PS3) 10Banyek: Labs: Add support for socket path and/or port to redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) [09:54:37] (03PS1) 10Alexandros Kosiaris: Do not default the hiera() deployment_server call [puppet] - 10https://gerrit.wikimedia.org/r/457862 [09:56:20] 10Operations, 10TCB-Team, 10WMDE-QWERTY-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-08-29: Release and deploy wikidiff2 v1.7.3 - https://phabricator.wikimedia.org/T202301 (10WMDE-Fisch) [09:56:57] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.008 second response time [09:57:03] (03CR) 10Alexandros Kosiaris: [C: 031] mediawiki: improve stop_cronjobs() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/457367 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:57:24] <_joe_> !log restarted pdfrender on scb1001 [09:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:56] (03CR) 10Alexandros Kosiaris: [C: 031] Add licence and copyright note [cookbooks] - 10https://gerrit.wikimedia.org/r/457521 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:58:40] (03CR) 10Alexandros Kosiaris: [C: 031] sre.switchdc.mediawiki: add Phase 5 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/457859 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:58:45] elukey: I think it won’t join the analytics channel because the channel is set to +r [09:58:57] (03CR) 10Alexandros Kosiaris: [C: 031] sre.switchdc.mediawiki: make space for new Phase 5 [cookbooks] - 10https://gerrit.wikimedia.org/r/457858 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:02:38] (03PS4) 10Banyek: Labs: Add support for socket path and/or port to redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) [10:05:03] paladox: hi! We didn't change anything recently, so it is due to new freenode policies [10:05:06] ? [10:05:43] elukey: could it be the channel is inheriting the modes from another channel ? [10:05:44] also I don't see the +r flag [10:06:35] Yeh if it’s inheriting from another channel you won’t see it [10:08:53] elukey when joining that channel as unregistered i get "[11:08] == #wikimedia-analytics Cannot join channel (+b) - you are banned" [10:09:02] (which means it's +r) [10:09:21] (03CR) 10Ema: [C: 031] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:09:45] (03CR) 10Alexandros Kosiaris: [C: 032] Do not default the hiera() deployment_server call [puppet] - 10https://gerrit.wikimedia.org/r/457862 (owner: 10Alexandros Kosiaris) [10:09:57] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:10:16] (03CR) 10Alexandros Kosiaris: [C: 032] Parameterize tmpfs size [puppet/nginx] - 10https://gerrit.wikimedia.org/r/455830 (https://phabricator.wikimedia.org/T200722) (owner: 10Alexandros Kosiaris) [10:10:51] paladox: ah thanks! [10:11:36] so the icinga-vm is not able to join [10:11:55] your welcome :) [10:11:58] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:12:30] elukey there's a patch by Krenair to support authing [10:12:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/455277 [10:13:14] (03PS1) 10Alexandros Kosiaris: Update nginx submodule for tmpfs size change [puppet] - 10https://gerrit.wikimedia.org/r/457865 [10:13:17] so if I manage to add a ban exempt as described in the task, maybe I can make it work now [10:17:32] paladox: worked, thanks a lot :) [10:17:39] your welcome :) [10:17:52] !log restart ircecho on einstenium to force it re-join #wikimedia-analytics [10:17:54] (03CR) 10Alexandros Kosiaris: [C: 032] Update nginx submodule for tmpfs size change [puppet] - 10https://gerrit.wikimedia.org/r/457865 (owner: 10Alexandros Kosiaris) [10:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:04] (03Abandoned) 10Elukey: Move the nginx module back from environments/production [puppet] - 10https://gerrit.wikimedia.org/r/442244 (owner: 10Elukey) [10:23:06] (03Abandoned) 10Elukey: Move the nginx submodule to environments/production [puppet] - 10https://gerrit.wikimedia.org/r/442242 (owner: 10Elukey) [10:23:49] PROBLEM - Check systemd state on analytics1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:23:50] !log Deploy schema change on s3 codfw masters (this will generate lag on s3 codfw) [10:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:59] an1003 is me [10:24:33] (03CR) 10Alexandros Kosiaris: [C: 031] "Wrongly commented in this task, +1" [puppet] - 10https://gerrit.wikimedia.org/r/457492 (owner: 10Giuseppe Lavagetto) [10:25:59] RECOVERY - Check systemd state on analytics1003 is OK: OK - running: The system is fully operational [10:29:58] !log Deploy schema change on dbstore1002:s3 [10:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:44] (03PS1) 10Alexandros Kosiaris: Test switchover of the deployment server [puppet] - 10https://gerrit.wikimedia.org/r/457867 [10:43:32] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1114 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457868 [10:47:24] (03CR) 10Muehlenhoff: [C: 031] Test switchover of the deployment server [puppet] - 10https://gerrit.wikimedia.org/r/457867 (owner: 10Alexandros Kosiaris) [10:47:48] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10jcrespo) Missing partitions on codfw: ``` db2085:3311:enwiki:logging db2088:3311:enwiki:logging db2088:3312:bgwiktionary:revision db2088:3312:bgwiktionary:logging db2088... [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T1100). [11:00:04] Jonas_WMDE, odder, Urbanecm, d3r1ck, and Aleksey_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:38] o/ [11:02:42] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457464 (https://phabricator.wikimedia.org/T203392) (owner: 10Urbanecm) [11:03:12] hashar: you'll do the swat? [11:03:25] (I'm around) [11:03:47] zeljkof: yup [11:04:12] (03Merged) 10jenkins-bot: Two throttle rules for SMEX editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457464 (https://phabricator.wikimedia.org/T203392) (owner: 10Urbanecm) [11:05:03] !log hashar@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 01s) [11:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:02] !log hashar@deploy1001 Synchronized wmf-config/throttle.php: Two throttle rules for SMEX editathon - T203392 (duration: 00m 51s) [11:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:08] T203392: Mass account creation at editaton - https://phabricator.wikimedia.org/T203392 [11:07:08] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457455 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [11:08:22] (03Merged) 10jenkins-bot: Update logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457455 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [11:08:26] o/ [11:10:14] !log hashar@deploy1001 Synchronized static/images/project-logos: Update logos for the Russian Wikisource - T203343 (duration: 00m 49s) [11:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:18] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) I have checked bgwitionary, eowiki, idwiki and frwiktionary and they do not exist on eqiad either. [11:10:19] T203343: Create HiDPI logos for Russian Wikisource - https://phabricator.wikimedia.org/T203343 [11:11:50] !log stopping replication and running partitioning on logging on db1085:3311 T189107 [11:11:50] (03CR) 10Marostegui: [C: 031] "I have checked this with fixcopyrightwiki and it worked fine" [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) (owner: 10Banyek) [11:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:55] T189107: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 [11:12:45] d3r1ck: your patch is being merged [11:13:00] Yup, testing it now! [11:13:33] d3r1ck: I havent deployed it on mwdebug1001 yet :D [11:13:43] Okay, waiting for green light :) [11:14:20] d3r1ck: it is on mwdebug1001 now :] [11:14:30] Okay, testing... [11:14:33] I cant remember how things work, but there might be some delay before the javascript gets updated [11:15:22] Jonas_WMDE: Guten Tag. Are you around to swat deploy https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/456124/ "Enable WBQualityConstraintsSuggestionsBetaFeature on beta" ? :] [11:15:33] OH [11:15:34] that is for beta [11:15:52] (03CR) 10Hashar: [C: 032] "That is for beta and is thus harmless." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456124 (https://phabricator.wikimedia.org/T202712) (owner: 10Jonas Kress (WMDE)) [11:16:13] Jonas_WMDE: deployed :] [11:17:06] thanks hashar ! [11:17:12] (03Merged) 10jenkins-bot: Enable WBQualityConstraintsSuggestionsBetaFeature on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456124 (https://phabricator.wikimedia.org/T202712) (owner: 10Jonas Kress (WMDE)) [11:17:43] hashar: Thanks, I've tested and it works great! [11:18:15] d3r1ck: good deploying :) [11:18:24] Jonas_WMDE: also Aleksey_WMDE sent a patch to bump wmgWikibaseMaxItemIdForNewItemIdHtmlFormatter https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/457831/2/wmf-config/InitialiseSettings.php but is not in there apparently :] [11:18:36] (03CR) 10jenkins-bot: Two throttle rules for SMEX editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457464 (https://phabricator.wikimedia.org/T203392) (owner: 10Urbanecm) [11:18:37] hashar: Thanks :) [11:18:38] (03CR) 10jenkins-bot: Update logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457455 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [11:18:40] (03CR) 10jenkins-bot: Enable WBQualityConstraintsSuggestionsBetaFeature on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456124 (https://phabricator.wikimedia.org/T202712) (owner: 10Jonas Kress (WMDE)) [11:19:01] hashar: Krinkle prepared me for this yesterday night :) [11:19:21] d3r1ck: yeah Timo is quite awesome :] [11:19:36] :) [11:19:52] !log hashar@deploy1001 Synchronized php-1.32.0-wmf.19/extensions/DismissableSiteNotice: Revert "Use session storage instead of cookies for site notices" - T199274 (duration: 00m 50s) [11:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:57] T199274: Use localstorage for sitenotices instead of cookies - https://phabricator.wikimedia.org/T199274 [11:20:53] (03CR) 10Jcrespo: "2 comments below" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) (owner: 10Banyek) [11:21:12] so I am left with https://gerrit.wikimedia.org/r/c/457831/ Wikidata: Use new item ID formatter for Q1-Q100000 [11:21:16] poked WMDE in their channel [11:21:23] worse case, I deploy it later in the afternoon [11:21:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457871 [11:22:20] (03CR) 10Hashar: "I have not deployed this change during the scheduled SWAT window. I don't mind deploying it this afternoon though. Just join #wikimedia-o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457831 (https://phabricator.wikimedia.org/T201835) (owner: 10Aleksey Bekh-Ivanov (WMDE)) [11:23:51] hashar Aleksey should be back in a minute [11:23:53] will tell him [11:24:07] Jonas_WMDE: don't rush him though! We have time :] [11:24:13] going to grab a coffee [11:24:50] Hi! I'm here [11:25:04] hashar: Just gave some feedback: https://phabricator.wikimedia.org/T199274#4555278. [11:25:07] Sorry, got caught in coding [11:26:26] :] [11:26:50] d3r1ck: sounds great! [11:26:57] hashar: Can you deploy my patch? [11:27:00] Aleksey_WMDE: sure thing [11:27:07] are you familiar with mwdebug1001 ? [11:27:08] Awesome! [11:27:11] Sure [11:27:11] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457831 (https://phabricator.wikimedia.org/T201835) (owner: 10Aleksey Bekh-Ivanov (WMDE)) [11:27:17] or maybe that is not testable [11:27:31] It is [11:27:53] hashar: So what next, I'm done here? :) [11:28:06] d3r1ck: I guess it is all done :]]] [11:28:19] Okay thank you! [11:28:27] (03Merged) 10jenkins-bot: Wikidata: Use new item ID formatter for Q1-Q100000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457831 (https://phabricator.wikimedia.org/T201835) (owner: 10Aleksey Bekh-Ivanov (WMDE)) [11:28:32] Aleksey_WMDE: ok patch is on mwdebug1001 [11:28:44] Give me 5 minutes [11:30:09] Looks fine. Good to go [11:30:09] (03PS1) 10Jcrespo: mariadb: Depool db2085 during maintenance to prevent repl errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457872 [11:30:16] hashar: ^ [11:30:30] good [11:30:51] hmm [11:31:00] What? [11:31:22] ping me when finished, I am creating some errors on codfw due to maintenance [11:31:48] Aleksey_WMDE: sorry I have messed up :/ [11:31:58] I have deployed on mwdebug1001 a different patch [11:32:05] mwdebug1001 now really have your patch [11:32:06] sorry [11:32:16] No problem [11:32:19] jynus: after this patch we are done. So far no issues [11:32:57] (03PS1) 10Giuseppe Lavagetto: Add __init__ for the switchdc/services recipe [cookbooks] - 10https://gerrit.wikimedia.org/r/457873 [11:32:59] (03PS1) 10Giuseppe Lavagetto: switchdc/services: Add stage 0 [cookbooks] - 10https://gerrit.wikimedia.org/r/457874 [11:33:01] (03PS1) 10Giuseppe Lavagetto: switchdc/services: add stage 1 [cookbooks] - 10https://gerrit.wikimedia.org/r/457875 [11:33:03] (03PS1) 10Giuseppe Lavagetto: switchdc/services: add stage 2 [cookbooks] - 10https://gerrit.wikimedia.org/r/457876 [11:33:22] hashar: All good [11:33:32] \o/ [11:33:35] (03CR) 10jerkins-bot: [V: 04-1] Add __init__ for the switchdc/services recipe [cookbooks] - 10https://gerrit.wikimedia.org/r/457873 (owner: 10Giuseppe Lavagetto) [11:33:37] (03CR) 10jerkins-bot: [V: 04-1] switchdc/services: Add stage 0 [cookbooks] - 10https://gerrit.wikimedia.org/r/457874 (owner: 10Giuseppe Lavagetto) [11:33:45] (03CR) 10jerkins-bot: [V: 04-1] switchdc/services: add stage 2 [cookbooks] - 10https://gerrit.wikimedia.org/r/457876 (owner: 10Giuseppe Lavagetto) [11:33:56] (03CR) 10jerkins-bot: [V: 04-1] switchdc/services: add stage 1 [cookbooks] - 10https://gerrit.wikimedia.org/r/457875 (owner: 10Giuseppe Lavagetto) [11:34:38] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Wikidata: Use new item ID formatter for Q1-Q100000 - T201835 (duration: 00m 49s) [11:34:42] (03CR) 10Jcrespo: [C: 04-1] "Blocked on maintenance to finish and start replication aftwerwards." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457868 (owner: 10Jcrespo) [11:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:44] T201835: Use link formatter that uses cache instead of wb_terms for items Q1-Q100.000 - https://phabricator.wikimedia.org/T201835 [11:34:57] (03CR) 10jenkins-bot: Wikidata: Use new item ID formatter for Q1-Q100000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457831 (https://phabricator.wikimedia.org/T201835) (owner: 10Aleksey Bekh-Ivanov (WMDE)) [11:34:59] (03PS2) 10Jcrespo: mariadb: Depool db2085 during maintenance to prevent repl errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457872 [11:35:23] Aleksey_WMDE: looks good. Congratulations! [11:35:28] !log European SWAT completed [11:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:44] jynus: should be good now. Though I would prefer to let the cluster settle for a few minutes :] [11:35:50] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2085 during maintenance to prevent repl errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457872 (owner: 10Jcrespo) [11:35:52] hashar: Thanks! [11:35:58] Will test more [11:40:42] jynus: looks all good. [11:46:59] !log Preparing train deploy 1.32.0-wmf.20 | T191066 [11:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:04] T191066: 1.32.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T191066 [11:49:38] jouncebot: next [11:49:38] In 0 hour(s) and 10 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T1200) [11:50:02] !log Cutting branch 1.32.0-wmf.20 | T191066 [11:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:51] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457871 [11:56:10] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457871 (owner: 10Marostegui) [11:56:45] (03PS5) 10Banyek: Labs: Add support for socket path and/or port to redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) [11:57:25] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457871 (owner: 10Marostegui) [11:58:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3317 (duration: 00m 48s) [11:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T1200) [12:03:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457880 [12:05:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457880 (owner: 10Marostegui) [12:06:23] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457871 (owner: 10Marostegui) [12:06:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457880 (owner: 10Marostegui) [12:06:42] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457880 (owner: 10Marostegui) [12:07:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 (duration: 00m 48s) [12:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:45] !log Deploy schema change on db1123 [12:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:32] (03CR) 10Bstorm: "Fair enough :)" [puppet] - 10https://gerrit.wikimedia.org/r/457458 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [12:13:43] (03CR) 10Gehel: "minor comment inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/457711 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:24:38] !log scap prep 1.32.0-wmf.20 # T191066 [12:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:44] T191066: 1.32.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T191066 [12:29:24] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) I have checked that no codfw hosts have notifications disabled on puppet or on icinga itself. [12:29:44] (03PS2) 10Bstorm: block_sync: Small improvement to the drbd backup script [puppet] - 10https://gerrit.wikimedia.org/r/456740 (https://phabricator.wikimedia.org/T171394) [12:31:40] (03CR) 10Bstorm: [C: 032] block_sync: Small improvement to the drbd backup script [puppet] - 10https://gerrit.wikimedia.org/r/456740 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [12:32:38] !log Applied security patches for 1.32.0-wmf.20 | T191066 [12:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:44] T191066: 1.32.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T191066 [12:34:14] (03PS1) 10Hashar: Group 0 to 1.32.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457888 [12:35:29] !log scap clean 1.32.0-wmf.18 | T191066 [12:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:11] !log hashar@deploy1001 Pruned MediaWiki: 1.32.0-wmf.18 [keeping static files] (duration: 06m 58s) [12:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:40] (03PS4) 10Marostegui: filtered_tables: Remove unused columns [puppet] - 10https://gerrit.wikimedia.org/r/450934 (https://phabricator.wikimedia.org/T51191) [12:43:00] !log hashar@deploy1001 Started scap: testwiki to php-1.32.0-wmf.20 and rebuild l10n cache - T191066 [12:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:05] T191066: 1.32.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T191066 [12:43:56] (03CR) 10Marostegui: [C: 032] filtered_tables: Remove unused columns [puppet] - 10https://gerrit.wikimedia.org/r/450934 (https://phabricator.wikimedia.org/T51191) (owner: 10Marostegui) [12:52:54] (03CR) 10Volans: [C: 032] Add licence and copyright note [cookbooks] - 10https://gerrit.wikimedia.org/r/457521 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:53:36] (03Merged) 10jenkins-bot: Add licence and copyright note [cookbooks] - 10https://gerrit.wikimedia.org/r/457521 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:57:01] (03CR) 10Marostegui: [C: 031] Labs: Add support for socket path and/or port to redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) (owner: 10Banyek) [12:57:02] (03PS2) 10Volans: sre.switchdc.mediawiki: improve readability [cookbooks] - 10https://gerrit.wikimedia.org/r/457519 (https://phabricator.wikimedia.org/T199079) [12:57:31] !log reimaging backup2001 [12:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:18] jouncebot: next [12:59:18] In 0 hour(s) and 0 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T1300) [13:00:05] hashar: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T1300). [13:01:41] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: improve readability [cookbooks] - 10https://gerrit.wikimedia.org/r/457519 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:02:21] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: improve readability [cookbooks] - 10https://gerrit.wikimedia.org/r/457519 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:05:42] (03CR) 10Jcrespo: [C: 031] Labs: Add support for socket path and/or port to redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) (owner: 10Banyek) [13:06:05] !log restart memcached on mc1035 with -v option - T203429 [13:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:10] T203429: Improve memcache logs on mc* hosts - https://phabricator.wikimedia.org/T203429 [13:07:09] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2085 during maintenance to prevent repl errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457872 (owner: 10Jcrespo) [13:07:11] !log 1.32.0-wmf.20 is still syncing for testwiki due to l10ncache generation [13:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:21] (03PS3) 10Jcrespo: mariadb: Depool db2085 during maintenance to prevent repl errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457872 [13:07:48] (03PS2) 10Volans: sre.switchdc.mediawiki: make space for new Phase 5 [cookbooks] - 10https://gerrit.wikimedia.org/r/457858 (https://phabricator.wikimedia.org/T199079) [13:09:01] (03CR) 10Banyek: [C: 032] Labs: Add support for socket path and/or port to redact_sanitarium.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) (owner: 10Banyek) [13:09:17] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: make space for new Phase 5 [cookbooks] - 10https://gerrit.wikimedia.org/r/457858 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:09:55] (03CR) 10jenkins-bot: mariadb: Depool db2085 during maintenance to prevent repl errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457872 (owner: 10Jcrespo) [13:09:57] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: make space for new Phase 5 [cookbooks] - 10https://gerrit.wikimedia.org/r/457858 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:11:22] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.281 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [13:12:22] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [13:13:03] (03PS6) 10Banyek: Labs: Add support for socket path and/or port to redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) [13:13:16] godog: FYI ^^^ (logstash) [13:13:25] (03CR) 10Banyek: [V: 032 C: 032] Labs: Add support for socket path and/or port to redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) (owner: 10Banyek) [13:13:27] (03CR) 10Jcrespo: "Yes." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/457857 (https://phabricator.wikimedia.org/T203394) (owner: 10Banyek) [13:13:42] volans: thanks! [13:13:52] checking [13:14:16] 10Operations, 10Traffic: certcentral: phantom test failure around challenge success - https://phabricator.wikimedia.org/T203422 (10Vgutierrez) I've been working under the assumption that basically our client was too aggressive and some times pebbles wasn't quick enough, but apparently it's getting stuck during... [13:18:02] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) I've created a custom Linux 4.14 kernel which worked fine in my tests with an updated firmware-qlogic. I've also created a netboot image based on Linux 4.14. It's bas... [13:21:38] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) sure [13:23:37] (03CR) 10Giuseppe Lavagetto: [C: 031] "Tests could be more complete but overall LGTM" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/457711 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:24:43] (03PS1) 10Bstorm: nfs-exportd: remove subtree_check from project exports [puppet] - 10https://gerrit.wikimedia.org/r/457896 (https://phabricator.wikimedia.org/T203254) [13:28:58] (03CR) 10Gehel: [C: 031] Add redis_cluster module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/457711 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:29:30] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10hashar) [13:30:32] (03PS1) 10Banyek: Labs: Make redact_sanitarium.sh file easier to read [puppet] - 10https://gerrit.wikimedia.org/r/457899 [13:31:06] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10hashar) I have hit that today. From my dupe task T203458 : rebuildLocalisationCache.php is run with --threads=30... [13:31:53] (03CR) 10Gehel: sre.switchdc.mediawiki: add Phase 0 cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/457328 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:32:20] (03CR) 10Gehel: [C: 032] Add health check for categories endpoint without lag check [puppet] - 10https://gerrit.wikimedia.org/r/456187 (owner: 10Smalyshev) [13:32:29] (03PS9) 10Gehel: Add health check for categories endpoint without lag check [puppet] - 10https://gerrit.wikimedia.org/r/456187 (owner: 10Smalyshev) [13:34:23] (03CR) 10Giuseppe Lavagetto: [C: 031] sre.switchdc.mediawiki: add Phase 4 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:34:31] (03CR) 10Jcrespo: Labs: Make redact_sanitarium.sh file easier to read (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/457899 (owner: 10Banyek) [13:34:33] (03CR) 10Volans: "Some comments and questions inline" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/457873 (owner: 10Giuseppe Lavagetto) [13:34:50] (03CR) 10Volans: switchdc/services: Add stage 0 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/457874 (owner: 10Giuseppe Lavagetto) [13:34:57] (03CR) 10Marostegui: Labs: Make redact_sanitarium.sh file easier to read (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/457899 (owner: 10Banyek) [13:35:03] 10Operations, 10Traffic: certcentral: phantom test failure around challenge success - https://phabricator.wikimedia.org/T203422 (10Vgutierrez) Same happens with http-01 validation, but in this case pebble output is more helpful cause it's more verbose: ```expected pebble output during http-01 validation Pebble... [13:35:16] (03CR) 10Volans: [C: 04-1] "See inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/457875 (owner: 10Giuseppe Lavagetto) [13:35:55] (03CR) 10Volans: [C: 031] "LGTM, modulo making CI happy" [cookbooks] - 10https://gerrit.wikimedia.org/r/457876 (owner: 10Giuseppe Lavagetto) [13:37:12] (03CR) 10Giuseppe Lavagetto: [C: 031] sre.switchdc.mediawiki: add Phase 5 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/457859 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:38:07] (03CR) 10Muehlenhoff: [C: 031] "Looks good, we don't need subtree checking" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/457896 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [13:40:08] (03PS1) 10Gehel: wdqs: correct argument validation in check categories script [puppet] - 10https://gerrit.wikimedia.org/r/457901 [13:40:50] (03CR) 10Gehel: [C: 032] wdqs: correct argument validation in check categories script [puppet] - 10https://gerrit.wikimedia.org/r/457901 (owner: 10Gehel) [13:42:20] (03CR) 10Volans: [C: 032] Add redis_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/457711 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:43:20] (03Merged) 10jenkins-bot: Add redis_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/457711 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:44:29] (03PS1) 10Jgreen: add frpig1001 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/457903 [13:44:41] (03CR) 10Volans: "Alternative approach proposed inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/457901 (owner: 10Gehel) [13:45:25] Is gerrit being unusably slow for anyone else? [13:45:56] !log hashar@deploy1001 Finished scap: testwiki to php-1.32.0-wmf.20 and rebuild l10n cache - T191066 (duration: 62m 55s) [13:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:02] T191066: 1.32.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T191066 [13:46:06] (03CR) 10Jgreen: [C: 032] add frpig1001 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/457903 (owner: 10Jgreen) [13:46:19] hmm just had to refresh the pages, weird :/ [13:48:09] (03PS6) 10Volans: sre.switchdc.mediawiki: add Phase 4 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) [13:49:35] (03PS1) 10Gehel: wdqs: use the power of argparse to validate the arguments instead of doing it manually [puppet] - 10https://gerrit.wikimedia.org/r/457906 [13:49:45] volans: your way is much nicer! ^argparse [13:50:08] (03PS2) 10Bstorm: nfs-exportd: remove subtree_check from project exports [puppet] - 10https://gerrit.wikimedia.org/r/457896 (https://phabricator.wikimedia.org/T203254) [13:50:17] (03CR) 10jerkins-bot: [V: 04-1] wdqs: use the power of argparse to validate the arguments instead of doing it manually [puppet] - 10https://gerrit.wikimedia.org/r/457906 (owner: 10Gehel) [13:50:31] gehel: sorry I didn't thought about it in the first review :) [13:51:20] (03PS2) 10Gehel: wdqs: use the power of argparse to validate the arguments [puppet] - 10https://gerrit.wikimedia.org/r/457906 [13:51:57] (03CR) 10Volans: [C: 031] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/457906 (owner: 10Gehel) [13:51:58] volans: better late than never! I realize that my check was actually wrong just after merging :/ [13:52:28] because you negated it ;) [13:52:40] * volans hides [13:52:47] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: add Phase 4 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:52:47] exactly! [13:53:00] * gehel should follow his own advice some times... [13:53:30] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add Phase 4 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:54:06] (03PS3) 10Gehel: wdqs: use the power of argparse to validate the arguments [puppet] - 10https://gerrit.wikimedia.org/r/457906 [13:55:12] (03CR) 10Gehel: [C: 032] wdqs: use the power of argparse to validate the arguments [puppet] - 10https://gerrit.wikimedia.org/r/457906 (owner: 10Gehel) [13:55:27] (03CR) 10Alex Monk: Packaging stuff and readme (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [13:57:30] (03CR) 10Hashar: [C: 032] "It is train time!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457888 (owner: 10Hashar) [13:58:58] (03Merged) 10jenkins-bot: Group 0 to 1.32.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457888 (owner: 10Hashar) [14:00:05] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.20 [14:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:59] !log stopping replication and running partitioning on logging on db1088:3311 T189107 [14:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:04] T189107: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 [14:04:21] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2085 during maintenance to prevent repl errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457911 [14:05:54] (03PS3) 10Bstorm: nfs-exportd: remove subtree_check from project exports [puppet] - 10https://gerrit.wikimedia.org/r/457896 (https://phabricator.wikimedia.org/T203254) [14:06:59] (03PS3) 10Muehlenhoff: Remove now obsolete Hiera setting profile::base::enable_microcode [puppet] - 10https://gerrit.wikimedia.org/r/457532 [14:07:30] (03CR) 10Bstorm: [C: 032] nfs-exportd: remove subtree_check from project exports [puppet] - 10https://gerrit.wikimedia.org/r/457896 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [14:07:46] jouncebot: now [14:07:46] For the next 0 hour(s) and 52 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T1300) [14:07:51] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Halfak) I think that querying by within-judgement content should be very limited (and probably within the... [14:08:04] hashar: I'll use the tail end of the train slot to slip https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/457912/ in then? :) [14:08:16] +2 :] [14:08:24] amazing! [14:09:49] (03PS4) 10Muehlenhoff: Remove now obsolete Hiera setting profile::base::enable_microcode [puppet] - 10https://gerrit.wikimedia.org/r/457532 [14:10:51] (03CR) 10Muehlenhoff: [C: 032] Remove now obsolete Hiera setting profile::base::enable_microcode [puppet] - 10https://gerrit.wikimedia.org/r/457532 (owner: 10Muehlenhoff) [14:11:43] (03PS1) 10Ema: ATS: pass hostname as an argument to default Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/457913 (https://phabricator.wikimedia.org/T199720) [14:12:16] * addshore goes to get a cup of tea while it merged [14:13:00] heh, 2 of the patch jobs are actually "queued" in zuul i guess we ran out of executors? [14:14:19] !log Removing subtree_check from project nfs on labstore1004/5 [14:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:39] (03CR) 10Ema: [C: 032] ATS: pass hostname as an argument to default Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/457913 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [14:15:01] (03PS15) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) [14:15:08] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [14:17:07] (03CR) 10jenkins-bot: Group 0 to 1.32.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457888 (owner: 10Hashar) [14:19:10] addshore: patch for wmf branches are handled by the zuul pipeline gate-and-submit-swat which has higher precedence [14:19:42] addshore: but yeah indeed not enough executors. As soon as I got rid of Nodepool I will optimize quibble [14:20:54] (03PS1) 10Vgutierrez: Fix DNS server input parsing [software/certcentral] - 10https://gerrit.wikimedia.org/r/457915 (https://phabricator.wikimedia.org/T203422) [14:22:11] (03CR) 10jerkins-bot: [V: 04-1] Fix DNS server input parsing [software/certcentral] - 10https://gerrit.wikimedia.org/r/457915 (https://phabricator.wikimedia.org/T203422) (owner: 10Vgutierrez) [14:22:30] *twiddles thumbs waiting for jenkins* [14:23:38] (03PS2) 10Bmansurov: Enable logging for Schema:CitationUsage at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454854 (https://phabricator.wikimedia.org/T191086) [14:30:45] 10Operations, 10Traffic, 10Patch-For-Review: certcentral: phantom test failure around challenge success - https://phabricator.wikimedia.org/T203422 (10Vgutierrez) As mentioned in https://gerrit.wikimedia.org/r/457915, performing a string strip() operation on a bunch of bytes that are actually a DNS query it'... [14:40:11] hashar: it finally got merged :) I'll sync :) [14:40:19] addshore: good thanks :) [14:42:41] 10Operations, 10Security-Team: deploy drupal as a GRC CMS for risk and compliance management - https://phabricator.wikimedia.org/T201860 (10chasemp) [14:42:46] 10Operations, 10Security-Team: deploy drupal as a GRC CMS for risk and compliance management - https://phabricator.wikimedia.org/T201860 (10chasemp) 05Open>03declined [14:44:09] syncing [14:44:40] (03PS18) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [14:45:12] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [14:45:15] !log addshore@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/Wikibase: [[gerrit:457912|Track new ItemId formatter usages]] (duration: 01m 17s) [14:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:34] hashar: all done [14:49:03] (03PS19) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [14:49:50] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [14:51:44] 10Operations, 10Analytics, 10vm-requests: eqiad (1) - VM request for Piwik/Matomo - https://phabricator.wikimedia.org/T202963 (10akosiaris) @elukey: Totally doable disk wise. Network wise, bohrium is at private1-a-eqiad, so no analytics network. Just making sure :D Can I also assume we will be deleting bo... [14:52:54] 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Ottomata) > But I'm unsure whether we should also standardise the value format of the header (as UUID). I'd be fine with not re... [14:53:11] 10Operations, 10ops-codfw: mw2213 correctable memory errors - https://phabricator.wikimedia.org/T194172 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Closing this task, opened T203434 for decom. [14:55:44] 10Operations, 10ORES, 10Scoring-platform-team, 10vm-requests: Site: 4 VM request for ORES poolcounter - https://phabricator.wikimedia.org/T203465 (10akosiaris) [14:55:47] (03PS20) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [14:55:56] 10Operations, 10ORES, 10Scoring-platform-team, 10vm-requests: Site: 4 VM request for ORES poolcounter - https://phabricator.wikimedia.org/T203465 (10akosiaris) p:05Triage>03Normal [14:56:21] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [14:57:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457920 [14:59:22] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457920 (owner: 10Marostegui) [14:59:31] (03PS21) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [14:59:58] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:00:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457920 (owner: 10Marostegui) [15:01:10] (03PS22) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:01:41] 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, and 3 others: Phabricator outbound email seems to have a SPOF of mx1001 - https://phabricator.wikimedia.org/T196916 (10herron) >>! In T196916#4552742, @ArielGlenn wrote: > Does this need more review/commentary before moving forward? We should be in... [15:02:03] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:02:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 (duration: 00m 57s) [15:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:14] !log Deploy schema change on db1095:3313 [15:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:34] (03CR) 10Herron: "fyi planning to merge this tomorrow (9/5)" [puppet] - 10https://gerrit.wikimedia.org/r/440910 (https://phabricator.wikimedia.org/T196916) (owner: 10Herron) [15:03:18] (03PS23) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:04:10] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457922 [15:04:40] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:04:43] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457920 (owner: 10Marostegui) [15:05:04] (03CR) 10Thcipriani: [C: 031] "Looks good to me!" (033 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398462 (owner: 10Hashar) [15:05:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457922 (owner: 10Marostegui) [15:06:45] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457922 (owner: 10Marostegui) [15:06:48] (03PS1) 10Papaul: DHCP: Change MAC address to test OS install on 10GB NIC [puppet] - 10https://gerrit.wikimedia.org/r/457923 [15:07:35] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Change MAC address to test OS install on 10GB NIC [puppet] - 10https://gerrit.wikimedia.org/r/457923 (owner: 10Papaul) [15:07:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 (duration: 00m 56s) [15:08:00] !log Deploy schema change on db1078 [15:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:13] (03PS24) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:09:32] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:10:49] (03PS25) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:12:23] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:12:32] (03CR) 10Giuseppe Lavagetto: [C: 031] mediawiki: improve stop_cronjobs() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/457367 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:13:09] !log shutting down backup2001 to disable 1GB NIC [15:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:29] (03PS26) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:14:31] (03PS2) 10Volans: mediawiki: improve stop_cronjobs() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/457367 (https://phabricator.wikimedia.org/T199079) [15:16:46] (03CR) 10Volans: [C: 032] mediawiki: improve stop_cronjobs() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/457367 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:16:57] PROBLEM - Host backup2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:18:10] (03PS1) 10Alexandros Kosiaris: Introduce orespoolcounter{1,2}00{1,2} [dns] - 10https://gerrit.wikimedia.org/r/457925 (https://phabricator.wikimedia.org/T203465) [15:18:24] (03Merged) 10jenkins-bot: mediawiki: improve stop_cronjobs() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/457367 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:18:31] moritzm: expired downtime for backup2001? [15:20:05] 10Operations, 10ops-eqdfw: unrack/decom cr1-eqdfw - https://phabricator.wikimedia.org/T202700 (10Papaul) @ayounsi I am planning on going to eqdfw this Thursday at 10:30am CDT [15:20:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457922 (owner: 10Marostegui) [15:21:39] (03PS2) 10Volans: sre.switchdc.mediawiki: add Phase 5 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/457859 (https://phabricator.wikimedia.org/T199079) [15:22:44] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: add Phase 5 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/457859 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:23:26] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add Phase 5 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/457859 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:24:09] 10Operations: syncing Ubuntu mirror fail - https://phabricator.wikimedia.org/T203290 (10Dzahn) 05Open>03Resolved a:03Dzahn 18:15 <@hloeung> mutante, tomreyn: fixing now. Broke when we added 2 more servers to rsync.archive.u.c record to increase the available rsync slots 11:23 < mutante> hloeung: thank yo... [15:26:19] (03PS27) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:26:28] (03CR) 10Giuseppe Lavagetto: [C: 031] spicerack: add redis sessions configuration [puppet] - 10https://gerrit.wikimedia.org/r/457836 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:26:57] (03PS3) 10Volans: spicerack: add redis sessions configuration [puppet] - 10https://gerrit.wikimedia.org/r/457836 (https://phabricator.wikimedia.org/T199079) [15:27:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Fix permissions on the directory; else, LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/457836 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:27:46] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:27:46] oops right [15:28:17] (03CR) 10Alex Monk: "recheck" [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:28:19] (03PS2) 10Dzahn: DHCP: Change MAC address to test OS install on 10GB NIC [puppet] - 10https://gerrit.wikimedia.org/r/457923 (https://phabricator.wikimedia.org/T196477) (owner: 10Papaul) [15:28:59] (03Abandoned) 10Ottomata: Initial debian packaging version 0.208 [debs/presto] - 10https://gerrit.wikimedia.org/r/456394 (https://phabricator.wikimedia.org/T203115) (owner: 10Ottomata) [15:29:18] (03CR) 10Dzahn: [C: 032] DHCP: Change MAC address to test OS install on 10GB NIC [puppet] - 10https://gerrit.wikimedia.org/r/457923 (https://phabricator.wikimedia.org/T196477) (owner: 10Papaul) [15:30:05] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:30:05] !log reboot aqs1004 after running fsck on root (was read-only) [15:30:09] (03PS1) 10Ema: 0.1.2: Consider 404 responses as valid [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/457927 (https://phabricator.wikimedia.org/T199720) [15:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:16] (03CR) 10jerkins-bot: [V: 04-1] 0.1.2: Consider 404 responses as valid [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/457927 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [15:30:19] (03CR) 10Alex Monk: "recheck" [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:31:43] (03PS2) 10Banyek: Labs: Make redact_sanitarium.sh file easier to read [puppet] - 10https://gerrit.wikimedia.org/r/457899 [15:32:46] (03PS2) 10Ema: 0.1.2: Consider 404 responses as valid [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/457927 (https://phabricator.wikimedia.org/T199720) [15:32:53] (03CR) 10jerkins-bot: [V: 04-1] 0.1.2: Consider 404 responses as valid [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/457927 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [15:35:06] 10Operations, 10Analytics, 10vm-requests: eqiad (1) - VM request for Piwik/Matomo - https://phabricator.wikimedia.org/T202963 (10elukey) >>! In T202963#4556071, @akosiaris wrote: > @elukey: Totally doable disk wise. > > Network wise, bohrium is at private1-a-eqiad, so no analytics network. Just making sure... [15:37:32] volans: yep, acking it [15:38:10] ok [15:38:12] thanks [15:40:24] (03PS2) 10Alex Monk: Fix DNS server input parsing [software/certcentral] - 10https://gerrit.wikimedia.org/r/457915 (https://phabricator.wikimedia.org/T203422) (owner: 10Vgutierrez) [15:40:26] (03PS28) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:40:28] (03PS2) 10Alex Monk: Rename certcentral_api to just api [software/certcentral] - 10https://gerrit.wikimedia.org/r/457378 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:40:30] (03PS2) 10Alex Monk: README: provide configuration file examples [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:40:58] RECOVERY - Host backup2001 is UP: PING WARNING - Packet loss = 37%, RTA = 36.20 ms [15:41:39] (03CR) 10Alex Monk: [C: 032] Fix DNS server input parsing [software/certcentral] - 10https://gerrit.wikimedia.org/r/457915 (https://phabricator.wikimedia.org/T203422) (owner: 10Vgutierrez) [15:42:30] (03CR) 10Alex Monk: [C: 032] Rename certcentral_api to just api [software/certcentral] - 10https://gerrit.wikimedia.org/r/457378 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:43:10] (03Merged) 10jenkins-bot: Fix DNS server input parsing [software/certcentral] - 10https://gerrit.wikimedia.org/r/457915 (https://phabricator.wikimedia.org/T203422) (owner: 10Vgutierrez) [15:44:32] (03PS4) 10Volans: spicerack: add redis sessions configuration [puppet] - 10https://gerrit.wikimedia.org/r/457836 (https://phabricator.wikimedia.org/T199079) [15:44:34] (03CR) 10jenkins-bot: Fix DNS server input parsing [software/certcentral] - 10https://gerrit.wikimedia.org/r/457915 (https://phabricator.wikimedia.org/T203422) (owner: 10Vgutierrez) [15:46:51] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) @MoritzMuehlenhoff here is what i get {F25650505} [15:50:22] (03CR) 10Alexandros Kosiaris: [C: 031] Display etcd /mediawiki-config values in noc.w.o [puppet] - 10https://gerrit.wikimedia.org/r/455578 (owner: 10Alexandros Kosiaris) [15:50:43] (03PS1) 10Muehlenhoff: Test 4.14 netboot image for backup2001 [puppet] - 10https://gerrit.wikimedia.org/r/457930 [15:50:51] (03CR) 10Alexandros Kosiaris: [C: 031] conftool: add class for writing to state to file [puppet] - 10https://gerrit.wikimedia.org/r/457490 (owner: 10Giuseppe Lavagetto) [15:51:35] (03CR) 10Alexandros Kosiaris: [C: 031] realm.pp: drop mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/457491 (owner: 10Giuseppe Lavagetto) [15:53:31] 10Operations, 10monitoring, 10security-team-backlog: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300 (10Bawolff) [15:53:45] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) @papaul: That's expected, this also need a change to the DHCP config to use the netboot image based on 4.14, e.g. by using the patch at https://gerrit.wikimedia.org/r... [15:54:32] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10ayounsi) 05Open>03Resolved Yep, all good! Thanks! Alert created with this runbook: https://wikitech.wikimedia.org/wiki/Network_monitoring#Port_with_no_description_on_... [16:00:05] godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T1600). [16:00:05] Ebe123, Reedy, and thcipriani: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:01:49] (03CR) 10Ema: [V: 032 C: 032] 0.1.2: Consider 404 responses as valid [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/457927 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [16:02:22] 10Operations, 10Traffic, 10Patch-For-Review: certcentral: phantom test failure around challenge success - https://phabricator.wikimedia.org/T203422 (10Krenair) 05Open>03Resolved [16:02:24] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) [16:03:11] o/ [16:03:13] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, and 2 others: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 (10herron) The config to proxy https://puppet-compiler.wmflabs.org across both jessie and stretch compiler hosts (4 hosts total) is live and h... [16:03:53] I'm taking a look [16:04:33] Reedy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/445604 says needs reworking, I'll skip it for puppet swat for now [16:04:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:05:09] 10Operations, 10Wikimedia-General-or-Unknown, 10security-team-backlog, 10WorkType-NewFunctionality: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10chasemp) [16:06:00] (03PS4) 10Filippo Giunchedi: Scap: update logstash_checker.py mwdeploy query [puppet] - 10https://gerrit.wikimedia.org/r/449639 (owner: 10Thcipriani) [16:06:08] (03CR) 10Filippo Giunchedi: [C: 032] Scap: update logstash_checker.py mwdeploy query [puppet] - 10https://gerrit.wikimedia.org/r/449639 (owner: 10Thcipriani) [16:06:59] (03PS29) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [16:07:01] (03PS1) 10Alex Monk: Add make_account CLI script [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 [16:08:35] !log fdans@deploy1001 Started deploy [analytics/refinery@2c4ec7a]: deploying refinery to update pageview def [16:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:16] thcipriani: merged, ran puppet on deploy hosts [16:09:27] thcipriani: I'm ok to do the scap upgrade now btw [16:09:31] (03PS2) 10Jcrespo: Revert "mariadb: Depool db2085 during maintenance to prevent repl errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457911 [16:09:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:09:47] godog: ok, I'll check that logstash_checker is working as expected [16:10:18] thcipriani: kk [16:11:19] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2085 during maintenance to prevent repl errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457911 (owner: 10Jcrespo) [16:11:37] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:11:43] godog: logstash_checker.py update checked and working well! thanks for that merge. [16:11:47] (03PS1) 10Ema: vhtcpd (0.1.2-1) stretch-wikimedia; urgency=medium [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/457934 (https://phabricator.wikimedia.org/T199720) [16:11:51] (03CR) 10Vgutierrez: Packaging stuff and readme (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [16:12:38] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2085 during maintenance to prevent repl errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457911 (owner: 10Jcrespo) [16:13:45] thcipriani: sweet, I'm uploading scap now [16:14:05] 10Operations, 10security-team-backlog, 10Release-Engineering-Team (Someday), 10User-greg: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270 (10chasemp) [16:14:17] godog: great! I can check whenever that's on the deployment hosts (that's the only place it should matter for this update) [16:15:50] (03PS1) 10Volans: sre.switchdc.mediawiki: wait TTL expiration [cookbooks] - 10https://gerrit.wikimedia.org/r/457936 (https://phabricator.wikimedia.org/T199079) [16:16:15] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) @MoritzMuehlenhoff ok I will change the install in DHCP [16:16:29] (03PS30) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [16:16:38] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:16:40] (03PS1) 10Filippo Giunchedi: scap: upgrade to 3.8.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/457938 (https://phabricator.wikimedia.org/T203271) [16:17:18] (03PS1) 10Dzahn: install_server: let backup2001 use stretch414-installer [puppet] - 10https://gerrit.wikimedia.org/r/457939 (https://phabricator.wikimedia.org/T196477) [16:17:31] (03CR) 10Filippo Giunchedi: [C: 032] scap: upgrade to 3.8.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/457938 (https://phabricator.wikimedia.org/T203271) (owner: 10Filippo Giunchedi) [16:17:41] (03PS2) 10Filippo Giunchedi: scap: upgrade to 3.8.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/457938 (https://phabricator.wikimedia.org/T203271) [16:18:02] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [16:18:32] !log fdans@deploy1001 Finished deploy [analytics/refinery@2c4ec7a]: deploying refinery to update pageview def (duration: 09m 57s) [16:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:14] (03CR) 10Dzahn: [C: 032] install_server: let backup2001 use stretch414-installer [puppet] - 10https://gerrit.wikimedia.org/r/457939 (https://phabricator.wikimedia.org/T196477) (owner: 10Dzahn) [16:19:22] (03PS2) 10Dzahn: install_server: let backup2001 use stretch414-installer [puppet] - 10https://gerrit.wikimedia.org/r/457939 (https://phabricator.wikimedia.org/T196477) [16:20:45] 10Operations, 10Security: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532 (10Bawolff) [16:20:55] thcipriani: upgraded on deploy1001 [16:21:14] !log upload scap 3.8.5-1 - T203271 [16:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:19] T203271: Update Debian Package for Scap to 3.8.5-1 - https://phabricator.wikimedia.org/T203271 [16:21:20] godog: cool, I'll run a test sync-wikiversions [16:21:58] (03PS31) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [16:22:04] (03PS3) 10Dzahn: install_server: let backup2001 use stretch414-installer [puppet] - 10https://gerrit.wikimedia.org/r/457939 (https://phabricator.wikimedia.org/T196477) [16:22:14] (03CR) 10Dzahn: [V: 032 C: 032] install_server: let backup2001 use stretch414-installer [puppet] - 10https://gerrit.wikimedia.org/r/457939 (https://phabricator.wikimedia.org/T196477) (owner: 10Dzahn) [16:23:10] (03PS2) 10Dzahn: planet: Add to en [puppet] - 10https://gerrit.wikimedia.org/r/457225 (owner: 10Legoktm) [16:24:14] (03CR) 10Dzahn: [C: 032] planet: Add to en [puppet] - 10https://gerrit.wikimedia.org/r/457225 (owner: 10Legoktm) [16:24:17] (03PS2) 10Alex Monk: Add make_account CLI script [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 [16:24:26] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: noop wikiversions sync for T198640 [16:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:32] T198640: Perform scap canary checks after sync-wikiversions - https://phabricator.wikimedia.org/T198640 [16:24:48] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2085 during maintenance to prevent repl errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457911 (owner: 10Jcrespo) [16:25:08] I'll do a sync-file test, too for good measure [16:25:09] (03CR) 10BBlack: [C: 031] vhtcpd (0.1.2-1) stretch-wikimedia; urgency=medium [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/457934 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [16:25:55] (03CR) 10jerkins-bot: [V: 04-1] Add make_account CLI script [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (owner: 10Alex Monk) [16:26:54] !log thcipriani@deploy1001 Synchronized README: noop sync file - test scap 3.8.5-1 (duration: 00m 54s) [16:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:05] godog: new scap version looks good! [16:27:14] thank you for the upgrade! [16:28:34] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Update Debian Package for Scap to 3.8.5-1 - https://phabricator.wikimedia.org/T203271 (10thcipriani) 05Open>03Resolved All seems working. Thank you @fgiunchedi ! [16:28:38] marlier: o/ [16:28:58] elukey: what's up? [16:29:06] thcipriani: np [16:29:35] marlier: there are several icinga alarms in critical state from some days ago, some of them related to (from what I can get) performance degradations [16:29:49] I am wondering if these are know and can be "acked" or if they need investigation [16:30:56] (03PS3) 10Alex Monk: Add make_account CLI script [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 [16:31:31] (03PS3) 10Dzahn: noc: Add Cache-Control with short max-age for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456206 (https://phabricator.wikimedia.org/T202734) (owner: 10Krinkle) [16:31:34] (03PS1) 10Volans: sre.switchdc.mediawiki: add validation of CLI args [cookbooks] - 10https://gerrit.wikimedia.org/r/457943 (https://phabricator.wikimedia.org/T199079) [16:31:58] elukey: I tried :-( https://usercontent.irccloud-cdn.com/file/lOCNpuhT/Screen%20Shot%202018-09-04%20at%2012.31.22%20PM.png [16:31:59] !log reboot aqs1004 again for kernel + openjdk-8 upgrades (now available since the root partition is not read only anymore) [16:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:22] marlier: I can do it if you want, is there a task or something? [16:32:50] let's make one to fix the permission issue, i can help [16:33:19] (03PS5) 10Volans: sre.switchdc.mediawiki: add Phase 3 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456510 (https://phabricator.wikimedia.org/T199079) [16:33:22] (03PS1) 10Volans: sre.switchdc.mediawiki: add --live-test option [cookbooks] - 10https://gerrit.wikimedia.org/r/457944 (https://phabricator.wikimedia.org/T199079) [16:33:46] mutante: yeah I was trying to check what perms where needed, likely LDAP ones? He is already in WMF [16:33:51] (03PS2) 10Herron: iegreview: change smtp_host to localhost [puppet] - 10https://gerrit.wikimedia.org/r/441131 (https://phabricator.wikimedia.org/T196920) [16:34:02] elukey: A couple - https://phabricator.wikimedia.org/T202703 and https://phabricator.wikimedia.org/T202703 [16:34:15] elukey: no, it's not related to an LDAP group in this case [16:34:23] without that he wouldnt be able to login [16:34:35] this part is that the logged-in user doesn't match the icinga contact name [16:34:48] (03CR) 10Herron: [C: 032] iegreview: change smtp_host to localhost [puppet] - 10https://gerrit.wikimedia.org/r/441131 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [16:35:03] (capitalization) or that the icinga contact isnt a contact for these services [16:35:22] or alternatively that we dont give global permisisons for all checks to perf team [16:36:12] the best fix is to make sure that a perf-team contact group is contact for all these checks, and not just "admins" [16:37:35] elukey: Regarding icinga alarms, I thought it was decided last year that we do not use icinga ack for Grafana alarms? [16:38:26] mutante: all right super ignorant about this, thanks :) [16:38:27] elukey: also note that it is a boolean system, so "critical" is the only state besides "okay". [16:38:50] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) Here what I get now {F25650699} [16:38:57] Krinkle: not aware of past decisions, I was just checking icinga and saw those criticals, so I asked around :) [16:39:16] Krinkle: it feels weird to leave those at critical state though, an ack is always good imho [16:39:18] elukey: Yeah, I understand. I guess there is a system overview dashboard in Icgina for ops? [16:39:31] elukey: Is there a way to "just" hide Grafana from that? [16:39:46] having a bunch of "unhandled" CRITs at all times is definitely something to avoid [16:39:49] multitenancy and icinga don't go well together Krinkle ;) [16:39:57] so we like to ACK them because that makes them handled [16:40:18] Icinga fully supports multi-tenancy [16:40:22] Krinkle: the other option is to downtime them for a given amount of time [16:40:23] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) @MoritzMuehlenhoff i will leave it to you so you can play with it tomorrow [16:40:24] right now they appear as "critical" alarms for a host called "einsteinium". [16:40:34] The same way a disk crash would appear on that physical host. [16:40:51] our puppet code just needs to add the right contact groups to fix this [16:41:17] 10Operations, 10security-team-backlog, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236 (10Bawolff) [16:42:05] Krinkle: yes, "hiding" them is the same as ACK [16:42:11] that would allow us to ack them, but note that we would be unlikely to do so given we don't look in icinga to find them. I don't want to be ignorant, but I think in practice we wouldn't log in until you point out there's an unhandled crit :/ - we may need to change our workflow, if that's best, I'm open to that. [16:42:17] it removes them from the "unhandled' column [16:42:52] Krinkle: ? you also dont read the email it sends though? [16:43:25] mutante: I do. We now group them by dashboard (matching subject names), and I star ones that have not recovered yet, to address/remember to look at. [16:44:07] Right now three of them are regressed because of a CentralNotice banner making the page load process longer. [16:44:10] That's "normal". [16:44:18] We can't ack that from the Grafana side short of disabling it. [16:44:34] So we tend to leave it until next week it goes back to normal and/or the baseline adjusts. [16:44:35] when you "star them", where does that happen? [16:44:39] Gmail [16:44:57] the issue in the workflow seems to be that we are using 2 separate places to label it [16:45:10] the meaning is the same whether you star them or we ack them [16:45:41] our only use for Icgina is to be an SMTP proxy from Grafana [16:45:54] because, apologies, ops vetoed against sending e-mail from Grafana [16:46:09] I might misremember... [16:46:40] ok, sorry, i dont know anything about these past decisions to not use Icinga and not send mail from grafana. seems to be an issue though [16:47:17] the purpose of Icinga web ui is kind of to do this thing and handle the alerts [16:47:43] many of our metric alerts are not actionable directly. So it's not really comparable to something like "disk full" where something has to happen. [16:47:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:50:05] mutante: the most common alerts we have is %-difference of page load time compared to last week. Those fire sometimes because Amazon performs an update, and makes the baseline shift. So those will be in warn-state for a week until "last week" is this week. It only notifies us on IRC and by e-mail once, so there's nothing else that needs to happen really. [16:50:17] it sounds like maybe they should not be in Icinga then and just emails, but i dont know why that was a problem in the past [16:50:47] because it means we'll have two monitoring systems capable of paging via IRC, E-mail and other methods. [16:51:53] For example, ops use of Grafana for mediawiki-fatals/exceptions is quite fitting, and presumably you'd want to keep that as-is. Or maybe that's a Graphite check in Icgina with a Grafana url in its description. Hm.. [16:52:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:53:09] we could also change them to never be CRIT and always just WARN [16:53:24] mutante: Is there an an easy way to ack by e-mail? Like some kind of command parser? [16:53:33] mutante: Yeah, that'd be fine I think. [16:53:34] if by definition they are not actioanable then they should probably never be called CRIT [16:53:53] also that means they dont show up on IRC anymore.. as CRITs they do [16:54:05] Oh, but they should show in IRC :/ [16:54:13] Krinkle: not by email, but per ssh [16:54:14] #wikimedia-perf-bots :) [16:54:49] Anyway, I'll file a task and let us think about it some more. [16:54:58] ok [16:57:36] please add me in CC so I can follow and avoid bothering people when we decide what to do :) [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T1700). [17:01:09] ah Krinkle, completely unrelated to the icinga thing, as FYI I restarted memcached on mc1035 with the '-v' option (and disabled puppet0 [17:01:55] (more info in T203429) [17:01:56] T203429: Improve memcache logs on mc* hosts - https://phabricator.wikimedia.org/T203429 [17:02:14] the downside is of course that one shard was wiped and it is being repopulated now [17:03:20] so far, the only "extra" log that I got is "Sep 04 13:09:20 mc1035 memcached[12681]: Failed to write, and not due to blocking: Broken pipe" [17:03:49] but it would be good anyway, in my opinion, to have these kind of error/warnings as debug info [17:04:08] (03PS1) 10Niharika29: Deploy TemplateWizard to test and test2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457950 (https://phabricator.wikimedia.org/T202545) [17:09:48] elukey: I noticed the puppet patch. I know very little about memcached's internal operation and haven't debugged it directly. [17:09:52] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10hashar) on deployment-deploy01: ``` mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l1... [17:10:53] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10User-Elukey: Improve memcache logs on mc* hosts - https://phabricator.wikimedia.org/T203429 (10Krinkle) [17:11:54] (03CR) 10Dzahn: [C: 032] noc: Add Cache-Control with short max-age for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456206 (https://phabricator.wikimedia.org/T202734) (owner: 10Krinkle) [17:11:59] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10User-Elukey: Improve memcache logs on mc* hosts - https://phabricator.wikimedia.org/T203429 (10Krinkle) Retagging as MediaWiki-Cache is about the PHP code inside MediaWiki core about caching (the the Memcached client, and the BagOStuff and... [17:12:01] (03PS4) 10Dzahn: noc: Add Cache-Control with short max-age for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456206 (https://phabricator.wikimedia.org/T202734) (owner: 10Krinkle) [17:12:05] Krinkle: I don't have a lot of info either, I started looking to mc1035 when I saw the occasional mw exceptions for memcached, and realized that the current systemd unit does not care about any log file [17:12:45] Then I saw all the conversation between you and Joe about the current issue [17:12:51] :) [17:13:10] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10hashar) Then I am using taskset to find the CPU affinity mask. The maintenance script is tied to CPU 1: ``` 16735... [17:13:10] shall we open a task to track the current status? [17:20:18] (03PS1) 10Giuseppe Lavagetto: dnsdisc: add methods for checking if a datacenter can be depooled [software/spicerack] - 10https://gerrit.wikimedia.org/r/457951 [17:20:45] elukey: I.. don't remember. Yes, a task sounds good. [17:22:49] (03CR) 10Gehel: Elasticsearch module is coming up. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [17:23:27] 10Operations, 10Discovery-Search (Current work): Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10Gehel) a:03Mathew.onipe [17:24:14] (03CR) 10Giuseppe Lavagetto: switchdc/services: Add stage 0 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/457874 (owner: 10Giuseppe Lavagetto) [17:25:38] (03CR) 10Giuseppe Lavagetto: switchdc/services: add stage 1 (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/457875 (owner: 10Giuseppe Lavagetto) [17:35:17] 10Puppet, 10Cloud-VPS, 10cloud-services-team: wmcs-roots group not granted access to new eqiad1 region bare metal servers - https://phabricator.wikimedia.org/T203488 (10bd808) [17:35:54] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Andrew) [17:36:28] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Andrew) a:03Andrew [17:38:00] 10Puppet, 10Cloud-VPS, 10cloud-services-team: wmcs-roots group not granted access to new eqiad1 region bare metal servers - https://phabricator.wikimedia.org/T203488 (10Andrew) a:03Andrew [17:39:15] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10Discovery-Search (Current work): rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10Gehel) [17:39:17] (03CR) 10Volans: "Looks mostly good to me, just one comment and few nitpicks (mostly optional, although I think might slightly improve it)" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/457951 (owner: 10Giuseppe Lavagetto) [17:39:58] (03PS2) 10Giuseppe Lavagetto: sre.switchdc.services: Add __init__ for the recipe [cookbooks] - 10https://gerrit.wikimedia.org/r/457873 (https://phabricator.wikimedia.org/T199079) [17:40:01] (03PS2) 10Giuseppe Lavagetto: sre.switchdc.services: Add phase 0 [cookbooks] - 10https://gerrit.wikimedia.org/r/457874 (https://phabricator.wikimedia.org/T199079) [17:40:03] (03PS2) 10Giuseppe Lavagetto: sre.switchdc.services: add phase 1 [cookbooks] - 10https://gerrit.wikimedia.org/r/457875 (https://phabricator.wikimedia.org/T199079) [17:40:04] (03PS2) 10Giuseppe Lavagetto: sre.switchdc.services: add phase 2 [cookbooks] - 10https://gerrit.wikimedia.org/r/457876 (https://phabricator.wikimedia.org/T199079) [17:43:08] RECOVERY - Memory correctable errors -EDAC- on wtp2020 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops [17:43:12] _joe_: this can be abandoned now if you want something out of your queue https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/383519/ [17:44:24] (03PS1) 10Andrew Bogott: wmcs: add role defines for things in the eqiad1 deploy [puppet] - 10https://gerrit.wikimedia.org/r/457954 (https://phabricator.wikimedia.org/T203488) [17:45:14] (03CR) 10Andrew Bogott: [C: 032] wmcs: add role defines for things in the eqiad1 deploy [puppet] - 10https://gerrit.wikimedia.org/r/457954 (https://phabricator.wikimedia.org/T203488) (owner: 10Andrew Bogott) [17:47:09] jouncebot: next [17:47:09] In 1 hour(s) and 12 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T1900) [17:47:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457956 [17:48:12] 10Puppet, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): wmcs-roots group not granted access to new eqiad1 region bare metal servers - https://phabricator.wikimedia.org/T203488 (10bd808) Do we need to add documentation somewhere for turning up new services/regions about this additional co... [17:49:25] (03PS2) 10Dzahn: admins: add ccicalese to analytics-privatedata-admins [puppet] - 10https://gerrit.wikimedia.org/r/456763 (https://phabricator.wikimedia.org/T203182) [17:49:42] (03CR) 10Dzahn: [C: 032] "approval on https://phabricator.wikimedia.org/T203182#4553011" [puppet] - 10https://gerrit.wikimedia.org/r/456763 (https://phabricator.wikimedia.org/T203182) (owner: 10Dzahn) [17:50:05] 10Puppet, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): wmcs-roots group not granted access to new eqiad1 region bare metal servers - https://phabricator.wikimedia.org/T203488 (10bd808) >>! In T203488#4557073, @gerritbot wrote: > Change 457954 **merged** by Andrew Bogott: > [operations/p... [17:52:19] (03PS6) 10Andrew Bogott: ircecho: Add support for authenticating with SASL [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [17:54:08] (03CR) 10Paladox: [C: 04-1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/434605 (owner: 10Paladox) [17:54:22] (03CR) 10Paladox: [C: 031] "Needs to be fixed gerrit side to support this move." [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [17:54:37] (03CR) 10Jdlrobson: [C: 031] Remove obsolete $wgPopupsBetaFeature, Part I: CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (owner: 10Prtksxna) [17:58:30] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457956 (owner: 10Marostegui) [17:59:51] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457956 (owner: 10Marostegui) [18:02:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 (duration: 00m 57s) [18:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:52] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:07:39] that's me, fix coming up [18:08:40] (03PS1) 10Dzahn: admins: fix cicalese's user name in analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/457959 (https://phabricator.wikimedia.org/T203182) [18:10:12] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:11:08] (03CR) 10Dzahn: [C: 032] admins: fix cicalese's user name in analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/457959 (https://phabricator.wikimedia.org/T203182) (owner: 10Dzahn) [18:11:40] (03CR) 10Paladox: [C: 031] "We can symlink the logs folder instead to /var/log/gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [18:11:52] (03PS5) 10Paladox: Gerrit: Move all logging to /var/log/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [18:12:02] (03PS6) 10Paladox: Gerrit: Move all logging to /var/log/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [18:13:41] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:14:22] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:14:34] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457956 (owner: 10Marostegui) [18:14:53] (03CR) 10Krinkle: "@Jon Could your team schedule this for SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (owner: 10Prtksxna) [18:15:15] (03PS7) 10Paladox: Gerrit: Move all logging to /var/log/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [18:16:31] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:22:02] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:23:58] 10Operations, 10netops: cr2-eqdfw (MX204) vhclient log noise - https://phabricator.wikimedia.org/T203261 (10ayounsi) a:03ayounsi Opened JTAC case 2018-0904-0650. [18:26:41] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:27:34] (03CR) 10Paladox: "Before merging this change you copy the logs from "/var/lib/gerrit2/review_site/logs" to "/var/log/gerrit" after then you remove "/var/lib" [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [18:38:37] !log change internal NAT for 208.80.155.12 - T203475 [18:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging in Hive (analytics-privatedata-users) for Cicalese - https://phabricator.wikimedia.org/T203182 (10Dzahn) @CCicalese_WMF on stat1004/stat1005 your user has been added to the additional group. Let us know if it works fo... [18:44:05] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:45:56] (03PS1) 10Jgreen: remove thulium from monitoring, adjust external IP for frpig1001 [puppet] - 10https://gerrit.wikimedia.org/r/457965 [18:46:53] (03CR) 10Jgreen: [C: 032] remove thulium from monitoring, adjust external IP for frpig1001 [puppet] - 10https://gerrit.wikimedia.org/r/457965 (owner: 10Jgreen) [18:47:51] (03PS3) 10Bstorm: wiki replicas: moving compatibility views to $table_compat [puppet] - 10https://gerrit.wikimedia.org/r/447654 (https://phabricator.wikimedia.org/T174047) [18:49:03] (03CR) 10Bstorm: wiki replicas: moving compatibility views to $table_compat (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447654 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [18:50:11] (03CR) 10Gehel: dnsdisc: add methods for checking if a datacenter can be depooled (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/457951 (owner: 10Giuseppe Lavagetto) [18:52:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging in Hive (analytics-privatedata-users) for Cicalese - https://phabricator.wikimedia.org/T203182 (10CCicalese_WMF) Thank you! I was able to log in to both successfully. [18:54:26] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10hashar) `/proc/self/status` has informations about the process notably: > * Cpus_allowed: Mask of CPUs on which t... [18:54:27] (03PS4) 10Bstorm: wiki replicas: moving compatibility views to $table_compat [puppet] - 10https://gerrit.wikimedia.org/r/447654 (https://phabricator.wikimedia.org/T174047) [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T1900) [19:04:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging in Hive (analytics-privatedata-users) for Cicalese - https://phabricator.wikimedia.org/T203182 (10Dzahn) 05Open>03Resolved Alright, thanks for confirming. I'll close this as resolved then. [19:06:51] (03PS1) 10Andrew Bogott: Add key and root access for Giovanni Tirloni [puppet] - 10https://gerrit.wikimedia.org/r/457972 [19:12:42] 10Operations, 10SRE-Access-Requests: Requesting access to Root for Giovanni Tirloni - https://phabricator.wikimedia.org/T203494 (10Andrew) [19:13:12] (03PS2) 10Andrew Bogott: Add key and root access for Giovanni Tirloni [puppet] - 10https://gerrit.wikimedia.org/r/457972 (https://phabricator.wikimedia.org/T203494) [19:13:51] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Root for Giovanni Tirloni - https://phabricator.wikimedia.org/T203494 (10Andrew) [19:14:30] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10GTirloni) [19:15:59] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Andrew) [19:16:04] (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [cookbooks] - 10https://gerrit.wikimedia.org/r/457943 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [19:18:43] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Andrew) [19:20:33] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10GTirloni) [19:21:14] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Andrew) [19:24:12] (03PS7) 10Andrew Bogott: ircecho: Add support for authenticating with SASL [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [19:25:39] (03CR) 10Andrew Bogott: [C: 032] ircecho: Add support for authenticating with SASL [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [19:26:00] !log rolling restart of elasticsearch / cirrus / eqiad for various updates and data directory migration completed - T198351 [19:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:06] T198351: Refactor puppet to support multiple elasticsearch instances on same node - https://phabricator.wikimedia.org/T198351 [19:28:44] (03PS27) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [19:30:31] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Andrew) [19:32:06] (03PS1) 10Alex Monk: ircecho: Fix spaces between --ident_passwd_file and --infile [puppet] - 10https://gerrit.wikimedia.org/r/457975 [19:33:31] (03CR) 10Andrew Bogott: [C: 032] ircecho: Fix spaces between --ident_passwd_file and --infile [puppet] - 10https://gerrit.wikimedia.org/r/457975 (owner: 10Alex Monk) [19:34:29] (03CR) 10Alex Monk: "This is a follow-up to I1e84b5b1 which was close but had a little space problem e.g.:" [puppet] - 10https://gerrit.wikimedia.org/r/457975 (owner: 10Alex Monk) [19:36:19] 10Operations, 10netops: cr2-eqdfw (MX204) vhclient log noise - https://phabricator.wikimedia.org/T203261 (10ayounsi) Confirmed by JTAC, it's PR1315128. [19:45:53] PROBLEM - MariaDB Slave Lag: s8 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 579.98 seconds [19:58:55] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: add validation of CLI args [cookbooks] - 10https://gerrit.wikimedia.org/r/457943 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [19:59:52] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add validation of CLI args [cookbooks] - 10https://gerrit.wikimedia.org/r/457943 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [20:03:10] (03CR) 10Volans: ircecho: Add support for authenticating with SASL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [20:04:29] (03CR) 10Krinkle: [C: 031] Update npm to 6.4.0 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453666 (https://phabricator.wikimedia.org/T169451) (owner: 10Legoktm) [20:11:33] (03PS28) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [20:13:42] (03PS1) 10Alex Monk: ircecho: Move ib3_auth to better place [puppet] - 10https://gerrit.wikimedia.org/r/457991 [20:15:44] * Krinkle staging on mwdebug1002/deployment for https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/FlaggedRevs/+/457992/ [20:16:03] (03PS1) 10Ottomata: [WIP][POC] Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 [20:16:45] (03PS3) 10EBernhardson: Deploy msearch daemon to cirrus servers [puppet] - 10https://gerrit.wikimedia.org/r/454722 (https://phabricator.wikimedia.org/T200740) [20:16:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP][POC] Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 (owner: 10Ottomata) [20:17:12] (03CR) 10Alex Monk: ircecho: Add support for authenticating with SASL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [20:18:09] (03PS2) 10Ottomata: [WIP][POC] Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 [20:18:55] (03CR) 10jerkins-bot: [V: 04-1] [WIP][POC] Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 (owner: 10Ottomata) [20:22:15] (03PS4) 10EBernhardson: Deploy msearch daemon to cirrus servers [puppet] - 10https://gerrit.wikimedia.org/r/454722 (https://phabricator.wikimedia.org/T200740) [20:30:21] (03Abandoned) 10Dzahn: conftool/client: rm 'obsolete distribution check in ubuntu <= trusty' [puppet] - 10https://gerrit.wikimedia.org/r/454592 (owner: 10Dzahn) [20:30:39] (03Abandoned) 10Dzahn: convert check_graphite to python3 [puppet] - 10https://gerrit.wikimedia.org/r/441209 (owner: 10Dzahn) [20:30:57] (03Abandoned) 10Dzahn: scap/deployment_server: replace trebuchet with mwdeploy user (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/433516 (owner: 10Dzahn) [20:31:25] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Andrew) [20:32:04] (03CR) 10Andrew Bogott: [C: 032] ircecho: Move ib3_auth to better place [puppet] - 10https://gerrit.wikimedia.org/r/457991 (owner: 10Alex Monk) [20:36:03] (03Abandoned) 10Dzahn: beta: add fixcopyright.wm to Apache sites/wikimedia.conf [puppet] - 10https://gerrit.wikimedia.org/r/456192 (https://phabricator.wikimedia.org/T202819) (owner: 10Dzahn) [20:40:21] (03PS1) 10Alex Monk: ircecho: Tidy up file absent [puppet] - 10https://gerrit.wikimedia.org/r/458055 [20:44:02] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Andrew) [20:47:04] (03PS1) 10Ladsgroup: Enable poolcounter for orespoolcounter[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/458056 (https://phabricator.wikimedia.org/T201824) [20:48:41] * Krinkle is still staging on mwdebug1002/deployment (Jenkins pending...) [20:52:59] (03CR) 10GTirloni: [C: 032] Add key and root access for Giovanni Tirloni [puppet] - 10https://gerrit.wikimedia.org/r/457972 (https://phabricator.wikimedia.org/T203494) (owner: 10Andrew Bogott) [20:54:11] jouncebot: next [20:54:12] In 2 hour(s) and 5 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T2300) [20:57:11] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10hashar) Using `PHP=php7.0 mwscript` it is not affected. So I guess HHVM ends up invoking `sched_setaffinity` whene... [21:00:03] legoktm: I plan to be done by then, lol [21:00:07] I'm not waiting that long for Jenkins [21:00:11] Got another npm checksum fail [21:00:16] At the very last step for webdriver [21:00:29] forcing through for now, none of this changed the UI [21:00:34] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/resources/src/startup/: mw.loader improvements and fixes (duration: 00m 58s) [21:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:31] Krinkle: ah ok, I want to deploy some beta stuff after you're done [21:01:40] np, 2 syncs left [21:01:50] I'm still writing patches :p [21:02:42] (03CR) 10Andrew Bogott: [C: 032] ircecho: Tidy up file absent [puppet] - 10https://gerrit.wikimedia.org/r/458055 (owner: 10Alex Monk) [21:03:02] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/extensions/FlaggedRevs/frontend/FlaggablePageView.php: I6dce0c59a3da9a (duration: 00m 58s) [21:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:29] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.20/resources/src/: resourceloader: startup and mediawiki.base improvements and fixes (duration: 00m 57s) [21:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:09] legoktm: done [21:07:25] (03CR) 10Cwhite: [C: 031] icinga: move Hiera calls to profile parameters [puppet] - 10https://gerrit.wikimedia.org/r/455248 (owner: 10Dzahn) [21:08:05] (03CR) 10Cwhite: [C: 031] icinga: rename role::icinga to profile::icinga [puppet] - 10https://gerrit.wikimedia.org/r/455247 (owner: 10Dzahn) [21:08:20] (03CR) 10Cwhite: [C: 031] icinga: move Hiera data from hosts to role [puppet] - 10https://gerrit.wikimedia.org/r/455262 (owner: 10Dzahn) [21:09:35] (03PS2) 10Legoktm: Enable EUCopyrightCampaign extensions/skin on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456780 (https://phabricator.wikimedia.org/T203299) [21:09:45] (03CR) 10Cwhite: [C: 031] icinga: add Hieradata for icinga1001, set to passive/disabled [puppet] - 10https://gerrit.wikimedia.org/r/455264 (owner: 10Dzahn) [21:09:47] (03PS3) 10Legoktm: Enable EUCopyrightCampaign extensions/skin on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456780 (https://phabricator.wikimedia.org/T203299) [21:12:57] (03CR) 10Legoktm: [C: 032] Enable EUCopyrightCampaign extensions/skin on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456780 (https://phabricator.wikimedia.org/T203299) (owner: 10Legoktm) [21:14:20] (03Merged) 10jenkins-bot: Enable EUCopyrightCampaign extensions/skin on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456780 (https://phabricator.wikimedia.org/T203299) (owner: 10Legoktm) [21:16:16] !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EUCopyrightCampaign extensions/skin on beta cluster (1/3) (duration: 00m 57s) [21:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:58] (03CR) 10Cwhite: [C: 031] "Although this might not be strictly necessary as $is_passive sets enable_notifications = 0, it might be worthwhile to ensure stopped until" [puppet] - 10https://gerrit.wikimedia.org/r/455238 (https://phabricator.wikimedia.org/T201344) (owner: 10Dzahn) [21:18:04] !log legoktm@deploy1001 Synchronized wmf-config/CommonSettings.php: Enable EUCopyrightCampaign extensions/skin on beta cluster (2/3) (duration: 00m 57s) [21:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:36] !log legoktm@deploy1001 Synchronized wmf-config/: Enable EUCopyrightCampaign extensions/skin on beta cluster (3/3) (duration: 00m 58s) [21:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:49] is bblack here? holaaa [21:20:30] * legoktm now waits for jenkins [21:22:28] (03PS1) 10Herron: mx: strengthen exim tls_require_ciphers [puppet] - 10https://gerrit.wikimedia.org/r/458061 (https://phabricator.wikimedia.org/T203260) [21:24:45] 10Operations, 10Mail, 10Patch-For-Review, 10User-herron: Outdated TLS config for MXes - https://phabricator.wikimedia.org/T203260 (10herron) Upgrading mx1001 to stretch first makes sense to me as well. Comparing mx2001 (stretch) to mx1001 (jessie), upgrading to stretch itself addressed RC4 and TLS_FALLBAC... [21:25:19] (03PS1) 10Cwhite: nagios_common: use libmonitoring-plugin-perl on stretch as libnagios-plugin-perl is deprecated [puppet] - 10https://gerrit.wikimedia.org/r/458062 (https://phabricator.wikimedia.org/T201344) [21:26:00] (03CR) 10jerkins-bot: [V: 04-1] nagios_common: use libmonitoring-plugin-perl on stretch as libnagios-plugin-perl is deprecated [puppet] - 10https://gerrit.wikimedia.org/r/458062 (https://phabricator.wikimedia.org/T201344) (owner: 10Cwhite) [21:26:01] (03CR) 10jenkins-bot: Enable EUCopyrightCampaign extensions/skin on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456780 (https://phabricator.wikimedia.org/T203299) (owner: 10Legoktm) [21:28:19] !log deleted archived file [21:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:52] (03PS5) 10Dzahn: icinga: make service_ensure status configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/455238 (https://phabricator.wikimedia.org/T201344) [21:34:29] (03CR) 10Greg Grossmeier: [C: 031] "We need to merge this and get it deployed." [puppet] - 10https://gerrit.wikimedia.org/r/439483 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [21:35:39] (03CR) 10Cwhite: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/458062 (https://phabricator.wikimedia.org/T201344) (owner: 10Cwhite) [21:36:14] (03CR) 10jerkins-bot: [V: 04-1] nagios_common: use libmonitoring-plugin-perl on stretch as libnagios-plugin-perl is deprecated [puppet] - 10https://gerrit.wikimedia.org/r/458062 (https://phabricator.wikimedia.org/T201344) (owner: 10Cwhite) [21:37:54] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/12349/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/455238 (https://phabricator.wikimedia.org/T201344) (owner: 10Dzahn) [21:38:13] (03CR) 10Subramanya Sastry: "I asked about ParserMigration at https://en.wikipedia.org/wiki/Wikipedia_talk:Linter#Question_about_ParserMigration_extension .. since if " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [21:40:33] (03PS6) 10Paladox: Gerrit: Add CoC and privacy policy to footer [puppet] - 10https://gerrit.wikimedia.org/r/439483 (https://phabricator.wikimedia.org/T196835) [21:41:16] (03Abandoned) 10Paladox: ircecho: Support auth over irc [puppet] - 10https://gerrit.wikimedia.org/r/405594 (owner: 10Paladox) [21:43:10] (03CR) 10Paladox: [C: 031] "Noop (after merged and deployed we can merge the puppet change)" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [21:47:29] (03PS5) 10Paladox: Link to gerrit-theme.html in scap repo [puppet] - 10https://gerrit.wikimedia.org/r/439504 (https://phabricator.wikimedia.org/T196835) [21:55:05] (03PS30) 10EBernhardson: Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) [21:55:07] (03PS29) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (https://phabricator.wikimedia.org/T198351) [21:55:09] (03PS31) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (https://phabricator.wikimedia.org/T198351) [21:55:11] (03PS38) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [21:55:13] (03PS64) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [21:56:55] (03CR) 10jerkins-bot: [V: 04-1] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [21:56:57] (03PS5) 10Dzahn: icinga: rename role::icinga to profile::icinga [puppet] - 10https://gerrit.wikimedia.org/r/455247 [21:58:03] (03CR) 10jerkins-bot: [V: 04-1] icinga: rename role::icinga to profile::icinga [puppet] - 10https://gerrit.wikimedia.org/r/455247 (owner: 10Dzahn) [22:04:49] (03PS39) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [22:04:51] (03PS65) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [22:14:58] (03PS1) 10Jgreen: prepare to decom thulium.frack.eqiad.wmnet, clean up related deprecated entries [dns] - 10https://gerrit.wikimedia.org/r/458067 [22:16:41] (03PS2) 10Jgreen: prepare to decom thulium.frack.eqiad.wmnet, clean up related deprecated entries [dns] - 10https://gerrit.wikimedia.org/r/458067 [22:18:08] (03CR) 10Jgreen: [C: 032] prepare to decom thulium.frack.eqiad.wmnet, clean up related deprecated entries [dns] - 10https://gerrit.wikimedia.org/r/458067 (owner: 10Jgreen) [22:19:30] !log authdns-update to deploy DNS changes removing thulium [22:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:40] !log add BGP sessions to AS8220 in eqiad + eqsin [22:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:30] (03PS6) 10Dzahn: icinga: rename role::icinga to profile::icinga [puppet] - 10https://gerrit.wikimedia.org/r/455247 [22:22:21] !log add BGP sessions to AS64096 in eqord [22:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:04] (03PS1) 10Legoktm: Enable EUCopyrightCampaign extensions and SkinPerPage for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458070 (https://phabricator.wikimedia.org/T203296) [22:27:06] (03PS1) 10Legoktm: Enable $wgULSLanguageDetection for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458071 (https://phabricator.wikimedia.org/T203179) [22:27:33] (03CR) 10Legoktm: [C: 04-2] "Still pending security review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458070 (https://phabricator.wikimedia.org/T203296) (owner: 10Legoktm) [22:29:15] (03CR) 10Dzahn: [C: 032] "no change except the resource name" [puppet] - 10https://gerrit.wikimedia.org/r/455247 (owner: 10Dzahn) [22:31:04] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: decommission thulium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T203520 (10Jgreen) [22:34:45] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10Krinkle) @hashar Nice research! For web servers, HHVM has 1 worker process per CPU (this happens before affinity... [22:39:18] Jeff_Green: smokeping alerts about "eqiad.frack.thulium" [22:39:49] garg [22:39:51] looking [22:41:26] XioNoX: maybe I should swap it out for the NAT address [22:42:17] I'm not sure what makes sense, we have smokeping configured to use thulium.frack.eqiad.wmnet vs e.g. frbast-eqiad.wikimedia.org [22:42:22] Jeff_Green: iirc, we have a NAT and a "non-nat" target [22:42:34] to monitor both sides [22:42:46] just an arbitrary target somewhere on the internal network? [22:43:10] https://smokeping.wikimedia.org/smokeping.cgi?target=eqiad.frack [22:43:37] ok, i'll switch it to frpig1001 then [22:43:41] 1 is the router, 1 is a NAT IP (frbast), 1 is a real host (thulium) [22:44:52] (03PS1) 10Jgreen: switch from thulium to frpig1001 for an internal smokeping target [puppet] - 10https://gerrit.wikimedia.org/r/458072 [22:45:41] (03CR) 10Jgreen: [C: 032] switch from thulium to frpig1001 for an internal smokeping target [puppet] - 10https://gerrit.wikimedia.org/r/458072 (owner: 10Jgreen) [22:48:31] jouncebot: next [22:48:31] In 0 hour(s) and 11 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T2300) [22:49:05] (03PS31) 10EBernhardson: Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) [22:49:07] (03PS30) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (https://phabricator.wikimedia.org/T198351) [22:49:09] (03PS32) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (https://phabricator.wikimedia.org/T198351) [22:49:11] (03PS40) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [22:49:13] (03PS66) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [22:51:52] 10Operations, 10Traffic, 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Legoktm) [15:43:27] bblack: hmm, I'm seeing weird stuff re: A-L & fixcopyrightwiki [15:43:33] km@km-pt ~> curl -I "https://fixcopyright.wikimedia.... [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180904T2300). [23:00:04] hauskatze and Niharika: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:10] YES [23:00:14] I'm here [23:01:02] (not a deployer, requesting that someone do it for me ktnx) [23:03:40] I'll do it when I get back to my desk in a few minutes [23:07:14] (03PS3) 10Dzahn: icinga: move Hiera calls to profile parameters [puppet] - 10https://gerrit.wikimedia.org/r/455248 [23:07:16] (03PS2) 10Dzahn: icinga: add Hieradata for icinga1001, set to passive/disabled [puppet] - 10https://gerrit.wikimedia.org/r/455264 [23:07:59] (03CR) 10jerkins-bot: [V: 04-1] icinga: move Hiera calls to profile parameters [puppet] - 10https://gerrit.wikimedia.org/r/455248 (owner: 10Dzahn) [23:08:34] thanks RoanKattouw [23:10:58] Thanks Roan. [23:12:24] (03PS9) 10Catrope: Modify gender namespaces for pl.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347) (owner: 10MarcoAurelio) [23:12:30] (03CR) 10Catrope: [C: 032] Modify gender namespaces for pl.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347) (owner: 10MarcoAurelio) [23:14:02] (03Merged) 10jenkins-bot: Modify gender namespaces for pl.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347) (owner: 10MarcoAurelio) [23:15:09] (03PS41) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [23:15:11] (03PS67) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [23:15:45] Hauskatze: Your patch is on mwdebug1002, please test [23:15:50] on it [23:16:01] (03CR) 10Catrope: [C: 032] Deploy TemplateWizard to test and test2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457950 (https://phabricator.wikimedia.org/T202545) (owner: 10Niharika29) [23:16:58] (03PS3) 10Dzahn: icinga: add Hieradata for icinga1001, set to passive/disabled [puppet] - 10https://gerrit.wikimedia.org/r/455264 [23:17:24] RoanKattouw: looks good to me, can you run namespaceDupes.php (dry run) before deploying and see if there's any conflicts? [23:17:32] before [23:17:52] otherwise we can proceed [23:18:18] 281 links to fix, 281 were resolvable. [23:18:36] can I have the output somewhere, please? [23:19:03] It's all pagelinks rows for WP: -> Wikipedia: [23:19:17] Most likely that feature was added to namespaceDupes after it was last run on this wiki [23:19:28] Wikipedia? it's Wiktionary, right? [23:19:31] Pretty sure it didn't have this feature back in ~2010 [23:19:34] Oh, oops lol [23:19:39] I ran it on plwiki, good catch [23:19:55] I'll file a task for plwiki lol [23:20:00] 0 pages to fix, 0 were resolvable. [23:20:01] good catch as well [23:20:04] yay [23:20:15] 15 links to fix, 15 were resolvable. (All of them WS: stuff, just like on plwiki) [23:20:28] so we can deploy this and then re-run with --fix [23:20:35] I can also just run this with --fix on both wikis [23:20:36] Yeah [23:20:53] if you'd like, it won't harm /me guesses [23:21:33] Yeah I don't think it will [23:21:45] (03CR) 10jenkins-bot: Modify gender namespaces for pl.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347) (owner: 10MarcoAurelio) [23:21:53] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Modify gender namespaces on plwiktionary (T202347) (duration: 00m 57s) [23:21:59] (03PS4) 10Dzahn: icinga: move Hiera calls to profile parameters [puppet] - 10https://gerrit.wikimedia.org/r/455248 [23:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:00] T202347: Change 'User' and 'User talk' namespace name on Polish Wiktionary (plwiktionary) - https://phabricator.wikimedia.org/T202347 [23:22:51] !log Ran namespaceDupes.php --fix on plwiktionary and plwiki [23:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:30] (03PS2) 10Catrope: Deploy TemplateWizard to test and test2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457950 (https://phabricator.wikimedia.org/T202545) (owner: 10Niharika29) [23:23:32] (03PS68) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [23:23:49] (03CR) 10Catrope: [C: 032] Deploy TemplateWizard to test and test2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457950 (https://phabricator.wikimedia.org/T202545) (owner: 10Niharika29) [23:26:26] (03Merged) 10jenkins-bot: Deploy TemplateWizard to test and test2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457950 (https://phabricator.wikimedia.org/T202545) (owner: 10Niharika29) [23:26:33] Everything seems to be working. I've pinged matmarex on the task so he can check as well but so far, everything seems to be fine -- thanks!!! [23:27:29] (03PS1) 10BBlack: fixcopyright: avoid duplicate Vary:A-L [puppet] - 10https://gerrit.wikimedia.org/r/458077 (https://phabricator.wikimedia.org/T203179) [23:28:41] Niharika: Yours is now on mwdebug1002, please test [23:29:22] RoanKattouw: Looking good. [23:29:27] OK deploying [23:30:49] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable TemplateWizard on testwiki and test2wiki (T202545) (duration: 00m 58s) [23:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:55] T202545: Deploy TemplateWizard - https://phabricator.wikimedia.org/T202545 [23:31:34] (03CR) 10BBlack: [C: 032] fixcopyright: avoid duplicate Vary:A-L [puppet] - 10https://gerrit.wikimedia.org/r/458077 (https://phabricator.wikimedia.org/T203179) (owner: 10BBlack) [23:31:40] Thanks RoanKattouw! [23:34:09] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/PageTriage/maintenance/: Add new maintenance script to fix deleted flag in PageTriage (T202582) (duration: 00m 58s) [23:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:14] T202582: "Nominated for deletion" filter doesn't work - https://phabricator.wikimedia.org/T202582 [23:34:49] I think there should be a step by step for users to contact there eu representatives. On the fix the copyright skin [23:34:56] https://meta.m.wikimedia.beta.wmflabs.org/wiki/Fix_copyright#take-action [23:38:16] (03CR) 10jenkins-bot: Deploy TemplateWizard to test and test2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457950 (https://phabricator.wikimedia.org/T202545) (owner: 10Niharika29) [23:41:44] 10Operations, 10Traffic, 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10BBlack) Should be fixed now, pending caches clearing out old results. I don't think it actually harms anything in the meantime. [23:49:39] (03CR) 10EBernhardson: "rebased on production. My current problem with this patch is I'm not sure how to deploy it. Currently when we deploy this patch:" [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [23:55:37] 10Operations, 10Traffic, 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Legoktm) Great :) And the special page TTL was fixed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EUCopyrightCampaign/+/457097 to be 24h.