[00:00:09] (03CR) 10Aklapper: [C: 04-1] "Wrong "to" date" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353921 (https://phabricator.wikimedia.org/T165421) (owner: 10Zppix) [00:01:49] (03PS1) 10Dzahn: wikistats: don't use /root/ dir for backup, use /usr/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/353933 [00:20:23] RECOVERY - HP RAID on ms-be1039 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [00:24:37] (03PS1) 10Dzahn: wikistats: create random db password once to bootstrap system [puppet] - 10https://gerrit.wikimedia.org/r/353936 [00:25:04] (03CR) 10Dzahn: [C: 032] wikistats: puppetize deploy script [puppet] - 10https://gerrit.wikimedia.org/r/353932 (owner: 10Dzahn) [00:27:42] (03CR) 10Dzahn: [C: 032] wikistats: don't use /root/ dir for backup, use /usr/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/353933 (owner: 10Dzahn) [00:30:31] (03PS2) 10Dzahn: wikistats: create random db password once to bootstrap system [puppet] - 10https://gerrit.wikimedia.org/r/353936 [00:32:15] (03CR) 10Dzahn: [C: 032] wikistats: create random db password once to bootstrap system [puppet] - 10https://gerrit.wikimedia.org/r/353936 (owner: 10Dzahn) [00:36:53] RECOVERY - HP RAID on ms-be1030 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [00:56:14] (03PS1) 10BryanDavis: bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 [00:57:00] (03CR) 10jerkins-bot: [V: 04-1] bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 (owner: 10BryanDavis) [00:57:59] lol. my dotfiles fail flake8 [00:58:14] * bd808 will think about fixing this some other time [01:09:51] bd808: heh, you can add a ".pep8" file in there telling it to ignore all that [01:10:23] RECOVERY - HP RAID on ms-be1031 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [01:16:13] (03CR) 10Dzahn: [C: 032] Labs contint: Install php5-gmp and php7.0-gmp [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) (owner: 10Paladox) [01:16:18] (03PS5) 10Dzahn: Labs contint: Install php5-gmp and php7.0-gmp [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) (owner: 10Paladox) [01:19:28] (03PS1) 10Dzahn: wikistats: install libapach2-mod-php7.0 if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/353938 [01:21:09] (03PS2) 10Dzahn: wikistats: install libapache2-mod-php7.0 if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/353938 [01:23:16] (03CR) 10Dzahn: [C: 032] wikistats: install libapache2-mod-php7.0 if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/353938 (owner: 10Dzahn) [01:23:21] (03PS3) 10Dzahn: wikistats: install libapache2-mod-php7.0 if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/353938 [01:30:53] PROBLEM - nova-compute process on labvirt1012 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [01:31:53] RECOVERY - nova-compute process on labvirt1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [01:50:18] (03PS1) 10Aklapper: Explicitly list task IDs for disabled user accounts in weekly Phab email [puppet] - 10https://gerrit.wikimedia.org/r/353939 [01:53:00] (03CR) 10Aklapper: [C: 04-1] "Ahem...checking DB columns, Spaces I cannot access are the reason." [puppet] - 10https://gerrit.wikimedia.org/r/353939 (owner: 10Aklapper) [01:58:12] (03PS1) 10Dzahn: wikistats: add db schema, auto-create db, adjust backup dir [puppet] - 10https://gerrit.wikimedia.org/r/353940 [01:59:09] (03CR) 10jerkins-bot: [V: 04-1] wikistats: add db schema, auto-create db, adjust backup dir [puppet] - 10https://gerrit.wikimedia.org/r/353940 (owner: 10Dzahn) [02:01:19] (03PS2) 10Dzahn: wikistats: add db schema, auto-create db, adjust backup dir [puppet] - 10https://gerrit.wikimedia.org/r/353940 [02:02:55] (03CR) 10Dzahn: [C: 032] wikistats: add db schema, auto-create db, adjust backup dir [puppet] - 10https://gerrit.wikimedia.org/r/353940 (owner: 10Dzahn) [02:03:00] (03PS3) 10Dzahn: wikistats: add db schema, auto-create db, adjust backup dir [puppet] - 10https://gerrit.wikimedia.org/r/353940 [02:11:52] (03PS1) 10Dzahn: wikistats: fix wrong file parameter, user -> owner [puppet] - 10https://gerrit.wikimedia.org/r/353941 [02:13:22] (03PS2) 10Dzahn: wikistats: fix wrong file parameter, user -> owner [puppet] - 10https://gerrit.wikimedia.org/r/353941 [02:16:16] (03CR) 10Dzahn: [C: 032] wikistats: fix wrong file parameter, user -> owner [puppet] - 10https://gerrit.wikimedia.org/r/353941 (owner: 10Dzahn) [02:20:16] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 12s) [02:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:19] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 16 02:26:19 UTC 2017 (duration 6m 3s) [02:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:03] RECOVERY - HP RAID on ms-be1029 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [02:47:03] (03PS1) 10Dzahn: wikistats: only run db init command once [puppet] - 10https://gerrit.wikimedia.org/r/353942 [02:47:19] (03CR) 10jerkins-bot: [V: 04-1] wikistats: only run db init command once [puppet] - 10https://gerrit.wikimedia.org/r/353942 (owner: 10Dzahn) [02:47:56] (03PS2) 10Dzahn: wikistats: only run db init command once [puppet] - 10https://gerrit.wikimedia.org/r/353942 [02:49:17] (03CR) 10Dzahn: [C: 032] wikistats: only run db init command once [puppet] - 10https://gerrit.wikimedia.org/r/353942 (owner: 10Dzahn) [02:49:48] (03Abandoned) 10Aklapper: Explicitly list task IDs for disabled user accounts in weekly Phab email [puppet] - 10https://gerrit.wikimedia.org/r/353939 (owner: 10Aklapper) [02:57:33] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [03:11:53] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=520.50 Read Requests/Sec=3051.50 Write Requests/Sec=11.80 KBytes Read/Sec=23420.40 KBytes_Written/Sec=110.40 [03:28:53] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=72.20 Read Requests/Sec=154.00 Write Requests/Sec=9.40 KBytes Read/Sec=2106.80 KBytes_Written/Sec=570.00 [04:05:53] RECOVERY - HP RAID on ms-be1028 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [04:08:53] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=7878.50 Read Requests/Sec=4149.90 Write Requests/Sec=7.00 KBytes Read/Sec=39910.80 KBytes_Written/Sec=86.80 [04:11:53] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.80 Read Requests/Sec=13.30 Write Requests/Sec=39.90 KBytes Read/Sec=167.20 KBytes_Written/Sec=376.40 [04:13:16] (03PS1) 10Dzahn: wikistats: grant db permissions on first run (labs) [puppet] - 10https://gerrit.wikimedia.org/r/353944 [04:15:26] (03CR) 10Dzahn: [C: 04-1] "wip" [puppet] - 10https://gerrit.wikimedia.org/r/353944 (owner: 10Dzahn) [04:29:33] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [04:32:33] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [05:05:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [05:10:13] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 27 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:15:13] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 12 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:17:34] !log fyi, one of the links between codfw and eqiad is down for a scheduled Zayo maintenance. No outage, traffic routed around. [05:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:23] (03PS2) 10Amire80: Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) [05:19:29] (03CR) 10jerkins-bot: [V: 04-1] Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) (owner: 10Amire80) [05:21:07] (03PS3) 10Amire80: Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) [05:21:53] (03CR) 10jerkins-bot: [V: 04-1] Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) (owner: 10Amire80) [05:29:33] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [05:32:33] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [05:42:33] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [05:57:58] (03PS1) 10Marostegui: db-codfw.php: Repool db2063, depool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353948 (https://phabricator.wikimedia.org/T162611) [05:59:46] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2063, depool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353948 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:01:00] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2063, depool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353948 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:01:31] (03CR) 10jenkins-bot: db-codfw.php: Repool db2063, depool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353948 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:02:12] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2063, depool db2056 - T162611 (duration: 00m 40s) [06:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:20] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [06:02:47] !log Deploy alter table on s2 (revision table) db2056 - T162611 [06:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:30] !log Run pt-table-checksum on s7.viwiki - T163190 [06:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:38] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [06:12:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:25:53] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [06:32:20] !log Disable replication codfw > eqiad on s3 https://phabricator.wikimedia.org/T147166 https://phabricator.wikimedia.org/T130067 [06:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:16] (03PS3) 10Giuseppe Lavagetto: profile::cassandra: auto-generate fqdns for seeds [puppet] - 10https://gerrit.wikimedia.org/r/353049 [06:37:47] !log Stop replication at the same position on db1044 and db2018 - https://phabricator.wikimedia.org/T147166 https://phabricator.wikimedia.org/T130067 [06:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:05] (03PS2) 10Muehlenhoff: package_builder: Install patchutils [puppet] - 10https://gerrit.wikimedia.org/r/353875 [06:55:38] (03CR) 10Muehlenhoff: [C: 032] package_builder: Install patchutils [puppet] - 10https://gerrit.wikimedia.org/r/353875 (owner: 10Muehlenhoff) [06:58:57] !log restarted hhvm on mw1165 (stuck in HPHP::Treadmill deadlock) [06:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1: Respected human, time to deploy ores_classification clean up party (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170516T0700). Please do the needful. [07:00:18] (03PS1) 10Ayounsi: Add private AS# to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/353951 (https://phabricator.wikimedia.org/T164911) [07:00:33] RECOVERY - HHVM jobrunner on mw1165 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.078 second response time [07:03:16] I start the cleaning right now [07:06:18] !log start of cleaning up ores_classification table in enwiki last round (four hours) (T159753) [07:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:26] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [07:10:36] !log upgrading mw1261-mw1265 to HHVM 3.18.2+wmf3 [07:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:27] 06Operations, 10Traffic, 07HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#3266376 (10Bawolff) [07:15:09] 06Operations, 05Security: Go from "E" to "A+" on Securityheaders.io - https://phabricator.wikimedia.org/T165455#3266380 (10Framawiki) [07:23:19] 06Operations, 06Performance-Team, 10Thumbor, 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3266384 (10Gilles) I've just realized that... [07:26:13] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect, AS1299/IPv6: Connect [07:28:36] 06Operations, 05Security: Go from "E" to "A+" on Securityheaders.io - https://phabricator.wikimedia.org/T165455#3266289 (10matmarex) We emit `X-Frame-Options: DENY` on all "sensitive" pages (pretty much everything that displays a form, e.g. action=edit or special pages). We probably can't do that on all pages... [07:32:59] 06Operations, 05Security: Go from "E" to "A+" on Securityheaders.io - https://phabricator.wikimedia.org/T165455#3266406 (10matmarex) We have the code to emit `Referrer-Policy`, and it's enabled on production wikis ('origin-when-cross-origin' by default, 'no-referrer' on private wikis), but it doesn't seem to a... [07:39:01] !log upload prometheus-mysqld-exporter 0.10.0 to jessie-wikimedia - T161296 [07:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:09] T161296: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296 [07:42:19] (03PS1) 10Muehlenhoff: Drop experimental apt repository from app servers [puppet] - 10https://gerrit.wikimedia.org/r/353952 [07:42:35] (03PS2) 10Muehlenhoff: Drop experimental apt repository from app servers [puppet] - 10https://gerrit.wikimedia.org/r/353952 [07:44:32] (03CR) 10Muehlenhoff: [C: 032] Drop experimental apt repository from app servers [puppet] - 10https://gerrit.wikimedia.org/r/353952 (owner: 10Muehlenhoff) [07:57:33] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [07:59:41] 06Operations, 05Security: Go from "E" to "A+" on Securityheaders.io - https://phabricator.wikimedia.org/T165455#3266289 (10Bawolff) > Content-Security-Policy This is probably the header that would improve our security the most. I've been working on this, but progress has been very slow, largely due to lack of... [08:05:53] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1002 (stat1004 or misc name?) - https://phabricator.wikimedia.org/T165368#3266433 (10Ottomata) stat1004 please! :) [08:06:15] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1003 (stat1005 or misc name?) - https://phabricator.wikimedia.org/T165366#3266434 (10Ottomata) stat1005 please! :) [08:07:01] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3266436 (10Ottomata) [08:07:14] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1005 (stat1002 replacement?) - https://phabricator.wikimedia.org/T165368#3266437 (10Ottomata) [08:07:22] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3264256 (10Ottomata) [08:07:51] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3264256 (10Ottomata) [08:08:38] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3264224 (10Ottomata) [08:08:48] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3264256 (10Ottomata) [08:09:25] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3266446 (10Ottomata) I modified the description for the host name, and I also moved the blurb about GPU from T165366 to this ticket, sin... [08:09:37] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3266448 (10fgiunchedi) [08:17:06] (03PS1) 10Muehlenhoff: Add apt pinning for git on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/353954 (https://phabricator.wikimedia.org/T140927) [08:22:37] !log installing git security updates on trusty (jessie already fixed) [08:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:53] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [08:26:26] (03CR) 10Filippo Giunchedi: [C: 031] Add apt pinning for git on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/353954 (https://phabricator.wikimedia.org/T140927) (owner: 10Muehlenhoff) [08:27:04] (03PS2) 10Ayounsi: Add private AS# to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/353951 (https://phabricator.wikimedia.org/T164911) [08:29:00] (03PS4) 10Niharika29: Deploy and enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353352 [08:30:13] (03CR) 10Ayounsi: [C: 032] Add private AS# to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/353951 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [08:31:37] (03CR) 10Chad: [C: 031] Add apt pinning for git on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/353954 (https://phabricator.wikimedia.org/T140927) (owner: 10Muehlenhoff) [08:34:18] (03PS2) 10Muehlenhoff: Add apt pinning for git on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/353954 (https://phabricator.wikimedia.org/T140927) [08:35:55] (03CR) 10Muehlenhoff: [C: 032] Add apt pinning for git on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/353954 (https://phabricator.wikimedia.org/T140927) (owner: 10Muehlenhoff) [08:40:57] (03PS1) 10Jcrespo: mariadb-auto_install: Remove db1056, add db1055 to reimage list [puppet] - 10https://gerrit.wikimedia.org/r/353955 [08:45:11] !log upgrading git packages on tin/naos from local 2.11 backport to the version from jessie-backports [08:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:42] (03CR) 10Marostegui: [C: 031] mariadb-auto_install: Remove db1056, add db1055 to reimage list [puppet] - 10https://gerrit.wikimedia.org/r/353955 (owner: 10Jcrespo) [08:46:23] (03PS2) 10Jcrespo: mariadb-auto_install: Remove db1056, add db1055 to reimage list [puppet] - 10https://gerrit.wikimedia.org/r/353955 [08:49:32] (03PS1) 10Jcrespo: mariadb: Depool db1055 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353956 [08:55:23] (03CR) 10Muehlenhoff: [C: 031] decommission mw2098 [puppet] - 10https://gerrit.wikimedia.org/r/353918 (owner: 10RobH) [08:59:13] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 12, down: 0, shutdown: 0 [08:59:50] !log upgrading mw1170-mw1184 from HHVM 3.18.2+wmf2 to 3.18.2+wmf3 [08:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:22] (03PS2) 10Volans: Do not auto-ucfirst when the query is a regex [software/cumin] - 10https://gerrit.wikimedia.org/r/345402 (https://phabricator.wikimedia.org/T161730) [09:03:01] 06Operations, 10netops, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3266536 (10fgiunchedi) @ayounsi thanks! I'm for excluding analytics hosts for this alarm on the basis that the alarm itself isn't actionable... [09:03:58] (03CR) 10Volans: [C: 032] "Ok, merging it as is for now, it can be improved later." [software/cumin] - 10https://gerrit.wikimedia.org/r/345402 (https://phabricator.wikimedia.org/T161730) (owner: 10Volans) [09:04:30] (03Merged) 10jenkins-bot: Do not auto-ucfirst when the query is a regex [software/cumin] - 10https://gerrit.wikimedia.org/r/345402 (https://phabricator.wikimedia.org/T161730) (owner: 10Volans) [09:05:02] (03PS2) 10Volans: PuppetDB backend: consistently use InvalidQueryError [software/cumin] - 10https://gerrit.wikimedia.org/r/346301 (https://phabricator.wikimedia.org/T162151) [09:05:04] 06Operations, 10netops, 13Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3266556 (10ayounsi) [09:05:35] (03CR) 10Volans: [C: 032] PuppetDB backend: consistently use InvalidQueryError [software/cumin] - 10https://gerrit.wikimedia.org/r/346301 (https://phabricator.wikimedia.org/T162151) (owner: 10Volans) [09:06:07] (03Merged) 10jenkins-bot: PuppetDB backend: consistently use InvalidQueryError [software/cumin] - 10https://gerrit.wikimedia.org/r/346301 (https://phabricator.wikimedia.org/T162151) (owner: 10Volans) [09:06:50] (03PS2) 10Volans: PuppetDB backend: forbid resource's parameters regex [software/cumin] - 10https://gerrit.wikimedia.org/r/346302 (https://phabricator.wikimedia.org/T162151) [09:07:55] (03CR) 10Volans: [C: 032] PuppetDB backend: forbid resource's parameters regex [software/cumin] - 10https://gerrit.wikimedia.org/r/346302 (https://phabricator.wikimedia.org/T162151) (owner: 10Volans) [09:08:25] (03Merged) 10jenkins-bot: PuppetDB backend: forbid resource's parameters regex [software/cumin] - 10https://gerrit.wikimedia.org/r/346302 (https://phabricator.wikimedia.org/T162151) (owner: 10Volans) [09:09:02] (03PS3) 10Filippo Giunchedi: swift: introduce storage policies [puppet] - 10https://gerrit.wikimedia.org/r/353878 (https://phabricator.wikimedia.org/T151648) [09:09:23] (03PS2) 10Volans: ClusterShell: fix set of list options [software/cumin] - 10https://gerrit.wikimedia.org/r/352796 (https://phabricator.wikimedia.org/T164824) [09:10:25] (03CR) 10Volans: [C: 032] ClusterShell: fix set of list options [software/cumin] - 10https://gerrit.wikimedia.org/r/352796 (https://phabricator.wikimedia.org/T164824) (owner: 10Volans) [09:10:47] (03CR) 10Jcrespo: [C: 032] mariadb-auto_install: Remove db1056, add db1055 to reimage list [puppet] - 10https://gerrit.wikimedia.org/r/353955 (owner: 10Jcrespo) [09:10:55] (03Merged) 10jenkins-bot: ClusterShell: fix set of list options [software/cumin] - 10https://gerrit.wikimedia.org/r/352796 (https://phabricator.wikimedia.org/T164824) (owner: 10Volans) [09:13:32] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3266585 (10Ottomata) @Cmjohnson any status update on the 1069 replacement? [09:19:13] (03PS2) 10Volans: ClusterShell: output directly when single host [software/cumin] - 10https://gerrit.wikimedia.org/r/352799 (https://phabricator.wikimedia.org/T164827) [09:21:35] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1055 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353956 (owner: 10Jcrespo) [09:26:08] (03CR) 10Volans: "addressed comment" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/352799 (https://phabricator.wikimedia.org/T164827) (owner: 10Volans) [09:26:30] !log upgrading mw1189 / mw1293 from HHVM 3.18.2+wmf2 to 3.18.2+wmf3 [09:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:23] PROBLEM - HHVM rendering on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:42:13] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 74657 bytes in 0.309 second response time [09:47:48] (03PS3) 10Volans: ClusterShell: output directly when single host [software/cumin] - 10https://gerrit.wikimedia.org/r/352799 (https://phabricator.wikimedia.org/T164827) [09:48:02] (03PS4) 10Filippo Giunchedi: swift: introduce storage policies [puppet] - 10https://gerrit.wikimedia.org/r/353878 (https://phabricator.wikimedia.org/T151648) [09:48:07] 06Operations, 10vm-requests, 05Goal, 07kubernetes: Create an etcd cluster in codfw for kubernetes usage - https://phabricator.wikimedia.org/T165467#3266706 (10akosiaris) [09:48:23] (03CR) 10Giuseppe Lavagetto: [C: 031] ClusterShell: output directly when single host [software/cumin] - 10https://gerrit.wikimedia.org/r/352799 (https://phabricator.wikimedia.org/T164827) (owner: 10Volans) [09:50:59] (03PS4) 10Giuseppe Lavagetto: profile::cassandra: auto-generate fqdns for seeds [puppet] - 10https://gerrit.wikimedia.org/r/353049 [09:51:26] (03CR) 10Giuseppe Lavagetto: [C: 032] "PCC says this is practically a noop." [puppet] - 10https://gerrit.wikimedia.org/r/353049 (owner: 10Giuseppe Lavagetto) [09:51:41] (03CR) 10Volans: [C: 032] ClusterShell: output directly when single host [software/cumin] - 10https://gerrit.wikimedia.org/r/352799 (https://phabricator.wikimedia.org/T164827) (owner: 10Volans) [09:52:21] (03Merged) 10jenkins-bot: ClusterShell: output directly when single host [software/cumin] - 10https://gerrit.wikimedia.org/r/352799 (https://phabricator.wikimedia.org/T164827) (owner: 10Volans) [09:53:53] (03PS2) 10Volans: Transports: move BaseWorker helper methods to module functions [software/cumin] - 10https://gerrit.wikimedia.org/r/352841 (https://phabricator.wikimedia.org/T164838) [09:55:57] (03PS2) 10Ema: Add unit tests for DNSQueryMonitoringProtocol [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/343655 [09:56:32] (03CR) 10Volans: [C: 032] Transports: move BaseWorker helper methods to module functions [software/cumin] - 10https://gerrit.wikimedia.org/r/352841 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [09:57:13] (03Merged) 10jenkins-bot: Transports: move BaseWorker helper methods to module functions [software/cumin] - 10https://gerrit.wikimedia.org/r/352841 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [09:58:24] (03CR) 10Chad: "What is a regex file?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 (owner: 10Paladox) [10:00:39] (03PS1) 10Alexandros Kosiaris: Introduce kubetcd200{1,2,3}.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/353962 (https://phabricator.wikimedia.org/T165467) [10:03:10] ... [10:04:19] (03Draft1) 10Paladox: Trusty: fix puppet [puppet] - 10https://gerrit.wikimedia.org/r/353964 [10:04:22] (03PS2) 10Paladox: Trusty: fix puppet [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) [10:05:10] (03CR) 10Giuseppe Lavagetto: [C: 031] Introduce kubetcd200{1,2,3}.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/353962 (https://phabricator.wikimedia.org/T165467) (owner: 10Alexandros Kosiaris) [10:09:16] (03PS3) 10Paladox: Trusty: fix puppet [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) [10:10:35] (03PS3) 10Giuseppe Lavagetto: restbase: convert production cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353050 [10:14:06] !log upgrading mw1185-mw1189 to Linux 4.9 and HHVM 3.18 [10:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:20] 06Operations, 05Security: Go from "E" to "A+" on Securityheaders.io - https://phabricator.wikimedia.org/T165455#3266289 (10Tgr) >>! In T165455#3266388, @matmarex wrote: > We emit `X-Frame-Options: DENY` on all "sensitive" pages (pretty much everything that displays a form, e.g. action=edit or special pages). W... [10:16:16] (03PS4) 10Paladox: HHVM: Fix puppet on trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) [10:20:38] (03PS4) 10Giuseppe Lavagetto: restbase: convert production cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353050 [10:25:51] 06Operations, 05Security: Go from "E" to "A+" on Securityheaders.io - https://phabricator.wikimedia.org/T165455#3266893 (10Bawolff) > In theory we could have a whitlist and then emit DENY or ALLOW-FROM depending on the origin, but it would have to be implemented in all kinds of things that render/cache wiki pa... [10:27:32] (03PS1) 10Dereckson: Rename MFCustomLogos to MinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353965 (https://phabricator.wikimedia.org/T164502) [10:27:55] !log addshore@terbium mwscriptwikiset extensions/Cognate/maintenance/populateCognatePages.php wiktionary.dblist --batch-size=1000 [10:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:02] (03CR) 10Dereckson: "Sync order doesn't matter, there is an `isset` in mobile.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353965 (https://phabricator.wikimedia.org/T164502) (owner: 10Dereckson) [10:28:51] !log T164407 addshore@terbium mwscriptwikiset extensions/Cognate/maintenance/populateCognatePages.php wiktionary.dblist --batch-size=1000 [10:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:59] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [10:30:58] !log installing openjdk-7 security updates on trusty hosts [10:31:01] (03PS2) 10Dereckson: Rename MFCustomLogos to MinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353965 (https://phabricator.wikimedia.org/T164502) [10:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:56] (03PS5) 10Giuseppe Lavagetto: restbase: convert production cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353050 [10:32:18] Dereckson: shall we swat that patch today? [10:32:18] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3266941 (10Ottomata) Update on this. @luca is working on T156933, and in talking, we realized that if we get rid of the second slave (db1047), we will only have one copy of E... [10:33:48] jdlrobson: if you wish [10:33:54] (03CR) 10Jdlrobson: Rename MFCustomLogos to MinervaCustomLogos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353965 (https://phabricator.wikimedia.org/T164502) (owner: 10Dereckson) [10:34:03] i do wish :) minor follow up though ^ [10:34:13] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [10:34:13] jdlrobson: we also have to drop wgMFTrademarkSitename? [10:34:44] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3266955 (10jcrespo) If redundancy is the main reason, and not load balancing, I would suggest having the redundant server on codfw. But there is now no analytics server on co... [10:35:02] https://tools.wmflabs.org/versions/ <- ah yes the version without wgMFTrademarkSitename is live everywhere [10:35:13] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [10:35:41] (03PS6) 10Giuseppe Lavagetto: restbase: convert production cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353050 [10:36:17] !log upgrading and restarting db2062's mariadb service [10:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:42] (03PS7) 10Giuseppe Lavagetto: restbase: convert production cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353050 [10:40:58] Dereckson: yup you can drop wgMFTrademarkSitename [10:41:00] it is not used any more [10:42:08] (03PS1) 10Dereckson: Drop wgMFTrademarkSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353969 [10:43:09] (03PS1) 10Amire80: [DON'T MERGE] Remove special Math extension settings for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353970 [10:44:42] (03PS1) 10Amire80: Remove UseMathJax from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353971 (https://phabricator.wikimedia.org/T165475) [10:45:24] (03CR) 10Amire80: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) (owner: 10Amire80) [10:47:02] (03CR) 10jerkins-bot: [V: 04-1] Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) (owner: 10Amire80) [10:47:55] Dereckson: should i put it on the calendar? [10:48:41] Dereckson: also did you see my follow up suggestion ? https://gerrit.wikimedia.org/r/#/c/353965/2 [10:49:22] !log Deploy schema change on testwikidatawiki.wb_terms on s3 codfw master - T165246 [10:49:26] (03CR) 10Jdlrobson: [C: 031] Drop wgMFTrademarkSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353969 (owner: 10Dereckson) [10:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:32] T165246: Add term_full_entity_id column to wb_terms table on testwikidatawiki - https://phabricator.wikimedia.org/T165246 [10:49:51] jdlrobson: okay, and as there is an isset, same: deploy order won't matter [10:50:35] (03CR) 10Muehlenhoff: "I'm changing this to use Wants: instead of Requires:, as recommended by systemd.unit(5)" [puppet] - 10https://gerrit.wikimedia.org/r/353556 (https://phabricator.wikimedia.org/T163795) (owner: 10Muehlenhoff) [10:51:56] jdlrobson: SpecialMobileWatchlist::getEmptyListHtml still uses ExtensionAssetsPath to get the path for "/MobileFrontend/images/emptywatchlist-page-actions-$dir.png" [10:51:57] (03PS4) 10Amire80: Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) [10:52:44] (but that's not related to use it for an image path) [10:55:00] jdlrobson: yes every current logo is an absolute path in /static/images/mobile/copyright/... [10:56:45] !log upgrading codfw app servers already using HHVM 3.18 to 3.18.2+wmf3 [10:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:26] (03PS2) 10Volans: Transports: add Command class [software/cumin] - 10https://gerrit.wikimedia.org/r/352842 (https://phabricator.wikimedia.org/T164838) [10:58:50] (03PS1) 10Dereckson: Drop {wgExtensionAssetsPath} support in MF/Minerva custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353973 [10:59:47] (03CR) 10Volans: "Addressed comments." (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/352842 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [10:59:57] (03PS3) 10Dereckson: Rename MFCustomLogos to MinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353965 (https://phabricator.wikimedia.org/T164502) [11:00:22] I need around ten minutes or so, 4M rows left, let's not do another round [11:01:42] (03CR) 10Dereckson: Rename MFCustomLogos to MinervaCustomLogos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353965 (https://phabricator.wikimedia.org/T164502) (owner: 10Dereckson) [11:03:36] I am going to do a rolling upgrade of labsdb1009/10/11- you will see here warnings of the proxies as I restart each server [11:09:23] PROBLEM - HHVM rendering on mw2098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:10:13] RECOVERY - HHVM rendering on mw2098 is OK: HTTP OK: HTTP/1.1 200 OK - 74657 bytes in 0.165 second response time [11:17:35] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1055 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353956 (owner: 10Jcrespo) [11:19:36] (03Merged) 10jenkins-bot: mariadb: Depool db1055 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353956 (owner: 10Jcrespo) [11:19:45] (03CR) 10jenkins-bot: mariadb: Depool db1055 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353956 (owner: 10Jcrespo) [11:21:49] (03PS1) 10Jdlrobson: Enable print styles for Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353979 (https://phabricator.wikimedia.org/T163287) [11:24:34] ^I am waiting to start my previous comment, will log when I actually start the restarts [11:24:40] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 for reimage (duration: 00m 39s) [11:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:49] !log cleaning up is completely done current number of rows: 9,261,264 T159753 [11:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:57] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [11:27:34] !log ladsgroup@terbium:~$ mwscript extensions/ORES/maintenance/CleanDuplicateScores.php --wiki=enwiki [11:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:55] Got 0 duplicates, cleaning them wooot [11:28:16] !log stopping db1055 before reimage for backup [11:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:41] jynus: marostegui What do think of shrinking the table (we are not busy) it's one tenth of original size now [11:28:52] *if you are not busy [11:29:05] Amir1, we will do that, but it takes time [11:29:09] (03CR) 10XXN: [C: 031] Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) (owner: 10Amire80) [11:29:14] because we may need to do it server by server [11:29:29] add a comment or file a task so we do not forget, thank you [11:29:38] sure [11:31:01] we can check how much size we saved before doing it [11:32:24] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/6434/" [puppet] - 10https://gerrit.wikimedia.org/r/353050 (owner: 10Giuseppe Lavagetto) [11:37:14] (03CR) 10Jdlrobson: [C: 031] Drop {wgExtensionAssetsPath} support in MF/Minerva custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353973 (owner: 10Dereckson) [11:37:35] !log upgrading mw1190-mw1208 to Linux 4.9 and HHVM 3.18 [11:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:44] 06Operations, 10Traffic, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#3267222 (10TheDJ) [11:39:13] PROBLEM - nova-compute process on labvirt1001 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [11:40:13] RECOVERY - nova-compute process on labvirt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [11:41:15] ^ huh [11:42:23] PROBLEM - DPKG on ganeti2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:42:23] PROBLEM - DPKG on ganeti2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:42:43] PROBLEM - DPKG on ganeti2007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:43:21] that's me ^ [11:43:23] RECOVERY - DPKG on ganeti2006 is OK: All packages OK [11:43:23] RECOVERY - DPKG on ganeti2002 is OK: All packages OK [11:43:43] RECOVERY - DPKG on ganeti2007 is OK: All packages OK [11:43:50] (03CR) 10XXN: "@Tim, I don't think this is good reason to block a site request here, especially as there already is a precedent and other customized site" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/31580 (https://phabricator.wikimedia.org/T43712) (owner: 10Dereckson) [11:45:13] PROBLEM - puppet last run on ganeti2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[qemu-system-x86] [11:46:03] PROBLEM - DPKG on ganeti2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:46:43] PROBLEM - DPKG on ganeti2007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:46:44] PROBLEM - DPKG on ganeti2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:46:53] PROBLEM - DPKG on ganeti2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:03] PROBLEM - DPKG on ganeti2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:13] PROBLEM - DPKG on ganeti2008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:23] PROBLEM - DPKG on ganeti2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:23] PROBLEM - DPKG on ganeti2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:32] (03CR) 10Dereckson: "According Gilles, from our Performance team, this change should improve the current situation: "Since 250px is much more common than 220px" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/31580 (https://phabricator.wikimedia.org/T43712) (owner: 10Dereckson) [11:49:53] RECOVERY - DPKG on ganeti2001 is OK: All packages OK [11:50:03] RECOVERY - DPKG on ganeti2003 is OK: All packages OK [11:50:04] RECOVERY - DPKG on ganeti2005 is OK: All packages OK [11:50:13] RECOVERY - DPKG on ganeti2008 is OK: All packages OK [11:50:23] RECOVERY - DPKG on ganeti2006 is OK: All packages OK [11:50:23] RECOVERY - DPKG on ganeti2002 is OK: All packages OK [11:50:43] RECOVERY - DPKG on ganeti2007 is OK: All packages OK [11:50:46] PROBLEM - puppet last run on ganeti2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[qemu-system-x86] [11:50:46] RECOVERY - DPKG on ganeti2004 is OK: All packages OK [11:52:03] PROBLEM - puppet last run on ganeti2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[qemu-system-x86] [11:52:43] RECOVERY - puppet last run on ganeti2005 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:52:59] (03PS2) 10Muehlenhoff: Make HHVM depend on nutcracker service [puppet] - 10https://gerrit.wikimedia.org/r/353556 (https://phabricator.wikimedia.org/T163795) [11:53:03] RECOVERY - puppet last run on ganeti2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [11:53:13] RECOVERY - puppet last run on ganeti2007 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:01:15] (03PS1) 10Marostegui: db-codfw.php: Depool db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353983 (https://phabricator.wikimedia.org/T162611) [12:03:16] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353983 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [12:04:11] 06Operations, 05Security: Go from "E" to "A+" on Securityheaders.io - https://phabricator.wikimedia.org/T165455#3267300 (10Tgr) >>! In T165455#3266893, @Bawolff wrote: > How would the third-party ensure the view is unauthenticated? Extra url parameter? Sandbox it without the `allow-same-origin` flag? The spec... [12:04:20] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353983 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [12:05:20] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2049 - T162611 (duration: 00m 39s) [12:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:28] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [12:06:15] !log Deploy alter table on s2 (revision table) db2049 - https://phabricator.wikimedia.org/T162611 [12:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:54] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3267305 (10Ottomata) ​+1, that sounds like a good idea to me! [12:22:48] (03CR) 10jenkins-bot: db-codfw.php: Depool db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353983 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [12:27:05] (03CR) 10Dereckson: [C: 031] "Not used anymore in Math extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353971 (https://phabricator.wikimedia.org/T165475) (owner: 10Amire80) [12:27:53] PROBLEM - ganeti-noded running on ganeti2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded [12:27:54] PROBLEM - ganeti-mond running on ganeti2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond [12:28:23] PROBLEM - ganeti-mond running on ganeti2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond [12:28:43] PROBLEM - ganeti-mond running on ganeti2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond [12:28:43] PROBLEM - ganeti-noded running on ganeti2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded [12:28:53] RECOVERY - ganeti-noded running on ganeti2001 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [12:29:43] RECOVERY - ganeti-mond running on ganeti2007 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond [12:29:43] RECOVERY - ganeti-noded running on ganeti2008 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [12:29:53] RECOVERY - ganeti-mond running on ganeti2008 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond [12:30:23] RECOVERY - ganeti-mond running on ganeti2001 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond [12:34:19] jouncebot: refresh [12:34:21] I refreshed my knowledge about deployments. [12:40:27] (03PS1) 10Mark Bergsma: Add pyenv and pydev config files to .gitignore [debs/pybal] - 10https://gerrit.wikimedia.org/r/353988 [12:54:58] (03PS1) 10Ottomata: Allow setting of zookeeper_version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/353989 [12:55:10] !log Run pt-table-checksum on s7.centralauth - https://phabricator.wikimedia.org/T163190 [12:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:27] (03CR) 10Ottomata: [C: 032] "No-op in https://puppet-compiler.wmflabs.org/6435/" [puppet] - 10https://gerrit.wikimedia.org/r/353989 (owner: 10Ottomata) [12:59:04] jouncebot: refresh [12:59:06] I refreshed my knowledge about deployments. [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170516T1300). Please do the needful. [13:00:04] Jdlrobson and Zppix: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:08] o/ [13:00:13] (03PS1) 10Ottomata: Revert "Allow setting of zookeeper_version in hiera" [puppet] - 10https://gerrit.wikimedia.org/r/353991 [13:00:39] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Allow setting of zookeeper_version in hiera" [puppet] - 10https://gerrit.wikimedia.org/r/353991 (owner: 10Ottomata) [13:02:29] \o [13:04:58] 06Operations, 10Analytics, 10Traffic, 15User-Elukey: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#3267476 (10elukey) [13:06:12] (03PS1) 10Ottomata: Allow setting of zookeeper_version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/353992 [13:07:10] 06Operations, 06Performance-Team, 10Thumbor, 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3267481 (10Gilles) I didn't write similar... [13:07:13] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [13:07:21] jdlrobson: so how are you doing today [13:07:23] !log upgrading mw2110-mw2117 to HHVM 3.18 [13:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:43] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [13:09:13] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [13:09:23] ottomata: ^ your unmerged changes? [13:10:01] 06Operations, 05Security: Go from "E" to "A+" on Securityheaders.io - https://phabricator.wikimedia.org/T165455#3267484 (10MoritzMuehlenhoff) p:05Triage>03Normal [13:10:12] Zppix: bit stressed. too many unbreak nows this week :) [13:10:15] 06Operations, 10ops-eqiad: decommission indium - https://phabricator.wikimedia.org/T165345#3267485 (10MoritzMuehlenhoff) p:05Triage>03Normal [13:10:37] jdlrobson: this week is the worst i bet being that releng is gone so swat is like non existant [13:12:00] hmm, i reverted mine, and it said no changes to merge [13:12:03] about to merge another, will see [13:12:16] ja No changes to merge. [13:12:19] (03CR) 10Ottomata: [C: 032] Allow setting of zookeeper_version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/353992 (owner: 10Ottomata) [13:12:42] im not sure who's doing swat today if anyone. addshore, aude or Dereckson seem like the only people available on this timezone. [13:12:52] or are you doing swats today ottomata ? [13:13:12] jdlrobson: naw i'm at an offsite and should be paying more attention to the real life meeting here :p [13:13:38] jdlrobson: I may be able to do swat, but not for another 15 mins [13:13:56] addshore: that would be awesome if you can. no rush from my side [13:14:30] real life meetings are overrated xD [13:16:01] 06Operations, 10Analytics, 10Traffic, 15User-Elukey: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#3227497 (10Nuria) Let's (as a first step) send these errors to graphite. [13:19:06] !log upgrading mw2163-mw2169 to HHVM 3.18 [13:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:43] jdlrobson: ready [13:24:19] (03PS4) 10Addshore: Rename MFCustomLogos to MinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353965 (https://phabricator.wikimedia.org/T164502) (owner: 10Dereckson) [13:24:22] (03CR) 10Addshore: [C: 032] Rename MFCustomLogos to MinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353965 (https://phabricator.wikimedia.org/T164502) (owner: 10Dereckson) [13:24:41] (03CR) 10Addshore: [C: 032] Rename MFCustomLogos to MinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353965 (https://phabricator.wikimedia.org/T164502) (owner: 10Dereckson) [13:24:57] (03PS2) 10Addshore: Drop {wgExtensionAssetsPath} support in MF/Minerva custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353973 (owner: 10Dereckson) [13:25:06] (03PS2) 10Addshore: Drop wgMFTrademarkSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353969 (owner: 10Dereckson) [13:25:41] (03Merged) 10jenkins-bot: Rename MFCustomLogos to MinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353965 (https://phabricator.wikimedia.org/T164502) (owner: 10Dereckson) [13:25:47] Zppix: https://gerrit.wikimedia.org/r/#/c/353921/ needs fixing [13:25:50] (03CR) 10jenkins-bot: Rename MFCustomLogos to MinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353965 (https://phabricator.wikimedia.org/T164502) (owner: 10Dereckson) [13:27:07] jdlrobson: the first change is on mwdebug1002 please check [13:27:42] addshore: looking [13:28:13] (03PS5) 10Zppix: Raise the account creation limit for www.enwp.org/WP:Meetup/Eugene/WikiAPA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353921 (https://phabricator.wikimedia.org/T165421) [13:28:15] addshore: on it [13:28:36] addshore: fixed... andre__ thanks for catching that typo [13:28:39] addshore: is this just the first one or all 3? [13:28:43] just the first [13:28:45] do you want all 3? [13:28:49] yes please for this one [13:28:56] sorry if that wasnt clear from the wiki page [13:28:57] (03CR) 10Addshore: [C: 032] Drop {wgExtensionAssetsPath} support in MF/Minerva custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353973 (owner: 10Dereckson) [13:28:59] (03CR) 10Addshore: [C: 032] Drop wgMFTrademarkSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353969 (owner: 10Dereckson) [13:29:17] (03PS6) 10Addshore: Raise the account creation limit for www.enwp.org/WP:Meetup/Eugene/WikiAPA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353921 (https://phabricator.wikimedia.org/T165421) (owner: 10Zppix) [13:29:57] (03Merged) 10jenkins-bot: Drop {wgExtensionAssetsPath} support in MF/Minerva custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353973 (owner: 10Dereckson) [13:30:09] (03CR) 10jenkins-bot: Drop {wgExtensionAssetsPath} support in MF/Minerva custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353973 (owner: 10Dereckson) [13:30:36] (03Merged) 10jenkins-bot: Drop wgMFTrademarkSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353969 (owner: 10Dereckson) [13:31:08] jdlrobson: you should have all 3 there now [13:31:15] thanks addshore on it [13:31:55] addshore: good to sync! [13:32:11] (03CR) 10jenkins-bot: Drop wgMFTrademarkSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353969 (owner: 10Dereckson) [13:32:19] I'm guessing a sync-dir is probably best? or will one file first work better? [13:32:26] It's hard to tell with all 3 patches ;) [13:35:29] synd-dir should be fine [13:36:00] !log addshore@tin Synchronized wmf-config/: SWAT: [[gerrit:353965|#1]] T164502, [[gerrit:353973|#2]], [[gerrit:353969|#3]] (duration: 00m 41s) [13:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:08] T164502: MFCustomLogos is now MinervaCustomLogos - https://phabricator.wikimedia.org/T164502 [13:36:11] {{done}} please double check [13:36:36] (03CR) 10Addshore: [C: 032] Raise the account creation limit for www.enwp.org/WP:Meetup/Eugene/WikiAPA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353921 (https://phabricator.wikimedia.org/T165421) (owner: 10Zppix) [13:36:47] Zppix: yours is next [13:36:54] addshore: untestable [13:36:56] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3267574 (10Cmjohnson) @ottomata I am not sure where we are with status of the replacement server. @robh may have a better idea [13:37:27] addshore: wait did you swat https://gerrit.wikimedia.org/r/#/c/353294/ or is that later? [13:37:36] (03Merged) 10jenkins-bot: Raise the account creation limit for www.enwp.org/WP:Meetup/Eugene/WikiAPA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353921 (https://phabricator.wikimedia.org/T165421) (owner: 10Zppix) [13:37:46] jdlrobson: I'll do that after this throttle change [13:37:51] ok great phew [13:37:57] panicked a little there as i hadnt tested :) [13:39:24] addshore: Zppix: hey yes it's *a little bit* testable [13:39:37] at least put it on mwdebug1002, open en.wikipedia.org and checks there isn't any syntax error [13:39:38] Dereckson: Zppix indeed, (I just checked it) :) [13:39:51] !log addshore@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:353921|Raise the account creation limit for www.enwp.org/WP:Meetup/Eugene/WikiAPA]] T165421 (duration: 00m 39s) [13:39:52] !log rebooting restbase2006 for update to Linux 4.9 and to pick up openjdk security updates [13:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:00] T165421: Throttle exception for English Wikipedia Edit-a-Thon on 2017-05-26 - https://phabricator.wikimedia.org/T165421 [13:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:12] aah jdlrobson https://gerrit.wikimedia.org/r/#/c/353294/1 is on master [13:40:17] i guess you want me to CP it to the branch? [13:40:36] addshore: whoops - https://gerrit.wikimedia.org/r/#/q/21763c4e8c1ea1f692887026e36934ead42bfdc8 [13:40:38] should be that one [13:40:46] already cherry picked it [13:40:49] but used wrong gerrit [13:40:55] ack :) [13:40:58] https://gerrit.wikimedia.org/r/353978 to be exact [13:41:18] jdlrobson: could you fix the link on wiki please? [13:44:12] Thanks addshore [13:44:38] 06Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#3267593 (10akosiaris) 05Open>03stalled I 've tested this and indeed it breaks running VMs as expected. I 've patched up ganeti and awaiting review. PR for this is up at https://github.com/ganeti/ganeti/pull/43. I... [13:45:25] jdlrobson: its on mwdebug1002 [13:46:37] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce kubetcd200{1,2,3}.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/353962 (https://phabricator.wikimedia.org/T165467) (owner: 10Alexandros Kosiaris) [13:46:48] addshore: great. You can sync now! [13:48:06] syncing [13:48:37] !log addshore@tin Synchronized php-1.30.0-wmf.1/extensions/QuickSurveys/extension.json: SWAT: [[gerrit:353978|Explicitly add mediawiki.cookie dependency]] (duration: 00m 39s) [13:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:46] !log SWAT done [13:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:45] (03CR) 10jenkins-bot: Raise the account creation limit for www.enwp.org/WP:Meetup/Eugene/WikiAPA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353921 (https://phabricator.wikimedia.org/T165421) (owner: 10Zppix) [13:53:15] 06Operations, 10Traffic, 10netops: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#2779434 (10BBlack) Re: ethernet port validation / config, the last table we had in the old ticket is here: T104458#1788478 . The idea was to try our best to ensure that a given vl... [13:53:24] !log upgrading mw2170-mw2179 to Linux 4.9 and HHVM 3.18 [13:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:16] thanks addshore for all the help today :) [13:56:21] no worries :) [13:56:25] will you be at the hackathon? [14:06:07] !log rebooting restbase2007 for update to Linux 4.9 and to pick up openjdk security updates [14:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:39] !log rolling restart labsdb1009,10,11 for mariadb upgrade (and kernel upgrade) [14:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:05] (03CR) 10BBlack: [C: 031] Interface: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/353332 (https://phabricator.wikimedia.org/T163196) (owner: 10Volans) [14:11:17] now is when several proxies will complain [14:13:33] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:13:41] that is 1 [14:13:47] it will happen 2 more times [14:18:16] (03PS8) 10Giuseppe Lavagetto: restbase: convert production cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353050 [14:22:56] (03CR) 10Giuseppe Lavagetto: [C: 032] restbase: convert production cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353050 (owner: 10Giuseppe Lavagetto) [14:23:36] <_joe_> ottomata: I count 3 unmerged patches of yours [14:23:59] <_joe_> should I merge them? [14:24:20] <_joe_> they've been there for more than one hour... [14:25:11] 1009 up again, restarting 10 [14:25:17] _joe_: I pinged him earlier, see backscroll at 15:12: "hmm, i reverted mine, and it said no changes to merge" [14:25:33] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [14:26:06] <_joe_> moritzm: I guess he ran puppet-merge on a backend [14:26:12] <_joe_> so let me merge those anyways [14:26:43] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [14:27:12] !log upgrading mw2180-mw2189 to Linux 4.9 and HHVM 3.18 [14:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:49] <_joe_> mobrovac: I'm running puppet on rb2001 [14:28:21] <_joe_> ah I keep doing the same mistake :P [14:28:23] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:28:32] heh [14:28:43] second host down [14:30:33] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:30:33] RECOVERY - HP RAID on ms-be1037 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [14:30:33] <_joe_> mobrovac: I forget to add the credentials to the new role, specifically [14:30:33] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:30:42] as expected [14:31:07] <_joe_> mobrovac: uhm I added a few nodes to restbase::seeds [14:31:13] <_joe_> in codfw [14:31:23] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:31:28] <_joe_> as it only included restbase2001 and restbase2002 [14:31:53] <_joe_> should I revert that for now? maybe it's a better idea [14:33:34] (03PS1) 10Giuseppe Lavagetto: role::restbase::production: restore original list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/353996 [14:33:45] <_joe_> mobrovac, urandom ^^ this would be the revert [14:34:20] !log kartik@tin Started deploy [cxserver/deploy@6118dda]: Update cxserver to 740641f [14:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:33] <_joe_> anyways, trying rb1007 now [14:35:02] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3267734 (10RobH) The server that dropped? We are still working on resolving the existing one, and getting it gone, and then ordering a new one. The new order will result in its own #... [14:36:41] !log kartik@tin Finished deploy [cxserver/deploy@6118dda]: Update cxserver to 740641f (duration: 02m 21s) [14:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:15] _joe_: yes revert for now and let's move more cautiously [14:37:27] <_joe_> cool [14:37:37] <_joe_> so the only change will be the eventlogging service uri [14:37:54] <_joe_> eqiad is going smoothly btw [14:38:02] (03CR) 10Giuseppe Lavagetto: [C: 032] role::restbase::production: restore original list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/353996 (owner: 10Giuseppe Lavagetto) [14:38:23] _joe_: ??? i merged them... [14:38:31] looking [14:38:33] <_joe_> ottomata: which host? [14:38:49] <_joe_> ottomata: I mean on which host did you ran puppet-merge from? [14:38:54] ahhh puppetmaster1001.eqiad.wmnet [14:38:59] doh! [14:39:02] 1002, ight? [14:39:07] <_joe_> nope, it's 1001 [14:39:09] hm [14:39:15] <_joe_> and trust me, it was *not* merged there [14:39:23] ok [14:39:25] <_joe_> either 1001 or 2001 are ok [14:39:28] (03PS2) 10Faidon Liambotis: interface: remove unused definition ::offload [puppet] - 10https://gerrit.wikimedia.org/r/353332 (https://phabricator.wikimedia.org/T163196) (owner: 10Volans) [14:39:29] [@puppetmaster1001:/home/otto] $ sudo puppet-merge [14:39:29] Fetching new commits from https://gerrit.wikimedia.org/r/p/operations/puppet [14:39:30] No changes to merge. [14:39:48] <_joe_> ottomata: now or before? [14:39:50] now [14:39:52] oh before? [14:39:53] you merged? [14:39:55] <_joe_> yes [14:39:58] ah, hm [14:39:59] <_joe_> I needed to [14:40:03] yeah its fine [14:40:05] mine was a no-op [14:40:05] (03PS9) 10Faidon Liambotis: interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) [14:40:07] (03PS8) 10Faidon Liambotis: cache: use interface_primary instead of eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) [14:40:08] <_joe_> so, let me finish my work for a sec [14:40:28] k, thanks, sorry about that... [14:41:33] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [14:41:33] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [14:42:29] <_joe_> mobrovac: I'm running puppet everywhere [14:42:49] <_joe_> mobrovac: then you can just restart restbase to pick up the new eventlogging uri [14:43:52] 06Operations, 10ops-codfw, 10DBA: db2058: Predictive RAID failure - https://phabricator.wikimedia.org/T165498#3267751 (10Marostegui) [14:44:05] _joe_: so you didn't switch rb to the new role/profile? [14:44:06] * mobrovac is confused now [14:44:57] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3267755 (10Ottomata) K, thanks. [14:45:32] <_joe_> mobrovac: I did, I'm running puppet around [14:46:01] <_joe_> mobrovac: as we discussed, the rb config didn't have one variable defined, the eventlogging service uri [14:46:22] <_joe_> so the change in practice is [14:46:22] <_joe_> -eventlogging_service_uri: "http://eventbus.svc.eqiad.wmnet:8085/v1/events" [14:46:24] <_joe_> +eventlogging_service_uri: "http://eventbus.discovery.wmnet:8085/v1/events" [14:46:28] kk [14:46:44] _joe_: lmk once puppet is done, i'll restart and monitor [14:47:51] (03PS1) 10Marostegui: db-codfw.php: Depool db2041, repool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354000 (https://phabricator.wikimedia.org/T162611) [14:48:27] <_joe_> mobrovac: done! [14:48:32] k [14:50:01] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2041, repool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354000 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [14:50:24] !log mobrovac@tin Started restart [restbase/deploy@d98af6f]: Apply new puppet role/profile paradigm [14:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:51] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2041, repool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354000 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [14:52:03] (03CR) 10jenkins-bot: db-codfw.php: Depool db2041, repool db2056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354000 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [14:53:23] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2056, depool db2041 - T162611 (duration: 00m 41s) [14:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:31] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [14:53:33] !log Deploy alter table on s2 (revision table) db2041 - https://phabricator.wikimedia.org/T162611 [14:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:49] (03CR) 10Volans: [C: 031] "Noop as expected: https://puppet-compiler.wmflabs.org/6436/" [puppet] - 10https://gerrit.wikimedia.org/r/353332 (https://phabricator.wikimedia.org/T163196) (owner: 10Volans) [14:53:54] (03PS3) 10Volans: interface: remove unused definition ::offload [puppet] - 10https://gerrit.wikimedia.org/r/353332 (https://phabricator.wikimedia.org/T163196) [14:55:34] (03PS1) 10Alexandros Kosiaris: Add kubernetes etcd records for codfw [dns] - 10https://gerrit.wikimedia.org/r/354002 (https://phabricator.wikimedia.org/T165467) [14:55:43] (03CR) 10Volans: [C: 032] interface: remove unused definition ::offload [puppet] - 10https://gerrit.wikimedia.org/r/353332 (https://phabricator.wikimedia.org/T163196) (owner: 10Volans) [14:56:06] (03PS1) 10Marostegui: db-codfw.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354003 [14:58:06] (03CR) 10Marostegui: [C: 032] db-codfw.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354003 (owner: 10Marostegui) [14:59:00] (03CR) 10Jcrespo: [C: 031] db-codfw.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354003 (owner: 10Marostegui) [15:00:08] (03Merged) 10jenkins-bot: db-codfw.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354003 (owner: 10Marostegui) [15:00:20] (03CR) 10jenkins-bot: db-codfw.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354003 (owner: 10Marostegui) [15:00:33] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3267810 (10faidon) 14.1X53-D43 seems to have been released on May 11th. This particular PR isn't mentioned on the release notes, so the fix may or may not be i... [15:00:47] and finally, latest restart of the batch: labsdb1011 [15:00:55] (03CR) 10Alexandros Kosiaris: [C: 032] Add kubernetes etcd records for codfw [dns] - 10https://gerrit.wikimedia.org/r/354002 (https://phabricator.wikimedia.org/T165467) (owner: 10Alexandros Kosiaris) [15:01:11] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove old comment (duration: 00m 39s) [15:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:10] (03PS10) 10Volans: interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [15:03:33] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [15:06:45] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3267820 (10Papaul) a:05Papaul>03RobH ge-3/0/18 [15:08:20] (03PS1) 10Alexandros Kosiaris: Introduce kubetcd200{1,2,3}.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/354005 [15:08:23] (03PS1) 10Alexandros Kosiaris: Use kubetcd200{1,2,3} in the kubernetes codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/354006 [15:09:56] _joe_: looking good [15:10:21] _joe_: no, wait [15:10:33] <_joe_> ha [15:10:36] <_joe_> what's wrong? [15:11:02] no sorry, false alarm [15:11:04] tutto bene [15:12:35] codfw queues now being filled by RB yay _joe_ [15:12:42] kafka queues that is [15:13:10] (03PS5) 10Paladox: HHVM: Fix puppet on trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) [15:13:18] (03PS6) 10Paladox: HHVM: Fix puppet on trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) [15:14:47] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/354005 (owner: 10Alexandros Kosiaris) [15:17:48] (03PS2) 10Alexandros Kosiaris: Introduce kubetcd200{1,2,3}.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/354005 [15:18:01] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Introduce kubetcd200{1,2,3}.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/354005 (owner: 10Alexandros Kosiaris) [15:20:50] seems translatewiki is down ? [15:25:11] (03PS11) 10Volans: interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [15:25:47] thedj, not a wikimedia site, maybe try #mediawiki-i18n ? [15:27:57] (03CR) 10Volans: [C: 032] interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [15:28:47] and last proxy up again [15:29:33] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [15:32:40] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3267961 (10Papaul) [15:34:07] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3238927 (10Papaul) a:05Papaul>03RobH Disk wipe complete, system unracked, racktables update [15:35:23] 06Operations, 06Parsing-Team, 07HHVM, 06Release-Engineering-Team (Watching / External), 07Wikimedia-Incident: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#3267979 (10greg) [15:35:34] 06Operations, 10DBA, 06Release-Engineering-Team (Watching / External): Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#3267987 (10greg) [15:38:08] 06Operations, 06Release-Engineering-Team (Backlog), 07Security-General: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030#3267996 (10greg) [15:38:17] 06Operations, 10Gerrit, 06Release-Engineering-Team (Backlog): Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3267998 (10greg) [15:38:26] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 06Release-Engineering-Team (Backlog): Gerrit: Schedule downtime to migrate db to utf8mb4 - https://phabricator.wikimedia.org/T155764#3268002 (10greg) [15:38:30] (sorry for any phab spam, we're redoing our workboards and... yeah, sorry) [15:38:33] 06Operations, 10Phabricator, 06Release-Engineering-Team (Backlog): reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#3268005 (10greg) [15:38:46] 06Operations, 06Security-Team, 06Release-Engineering-Team (Backlog), 15User-greg: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#3268011 (10greg) [15:39:07] 06Operations, 06Release-Engineering-Team (Backlog): Rewrite http://download.wikimedia.org/mediawiki/ -> https://releases.wikimedia.org/mediawiki in less than 3 redirects - https://phabricator.wikimedia.org/T119679#3268023 (10greg) [15:40:07] (03CR) 10Alexandros Kosiaris: [C: 032] uwsgi::app: add reload capability in systemd [puppet] - 10https://gerrit.wikimedia.org/r/352551 (owner: 10Giuseppe Lavagetto) [15:40:12] (03PS2) 10Alexandros Kosiaris: uwsgi::app: add reload capability in systemd [puppet] - 10https://gerrit.wikimedia.org/r/352551 (owner: 10Giuseppe Lavagetto) [15:40:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] uwsgi::app: add reload capability in systemd [puppet] - 10https://gerrit.wikimedia.org/r/352551 (owner: 10Giuseppe Lavagetto) [15:40:51] (03CR) 10Alexandros Kosiaris: [C: 032] service::uwsgi: fix logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/352552 (owner: 10Giuseppe Lavagetto) [15:40:58] (03PS6) 10Milimetric: Fixing "Book_talk" namespace alias for ro.wikipedia: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 (owner: 10XXN) [15:41:06] (03CR) 10Milimetric: [C: 032] Fixing "Book_talk" namespace alias for ro.wikipedia: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 (owner: 10XXN) [15:42:08] (03Merged) 10jenkins-bot: Fixing "Book_talk" namespace alias for ro.wikipedia: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 (owner: 10XXN) [15:42:20] (03CR) 10jenkins-bot: Fixing "Book_talk" namespace alias for ro.wikipedia: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 (owner: 10XXN) [15:43:03] 06Operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#3268064 (10jcrespo) [15:43:07] 06Operations, 10DBA, 06Release-Engineering-Team (Watching / External): Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#3268061 (10jcre... [15:44:33] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2051351 [15:48:07] 06Operations, 10Education-Program-Dashboard, 03Programs-and-Events-Dashboard-Sprint 2, 07Spike: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#3268097 (10greg) a:05dduvall>03None [15:48:17] !log restarting and upgrading db1095 [15:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:51] (03PS5) 10Alexandros Kosiaris: graphite::alerts: add alerting on session loss [puppet] - 10https://gerrit.wikimedia.org/r/350555 (owner: 10Giuseppe Lavagetto) [15:48:53] (03PS2) 10Alexandros Kosiaris: service::uwsgi: fix logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/352552 (owner: 10Giuseppe Lavagetto) [15:50:06] (03CR) 10Alexandros Kosiaris: "I 've updated this by mistake, sorry" [puppet] - 10https://gerrit.wikimedia.org/r/350555 (owner: 10Giuseppe Lavagetto) [15:50:17] (03CR) 10Alexandros Kosiaris: [C: 032] service::uwsgi: fix logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/352552 (owner: 10Giuseppe Lavagetto) [15:50:28] (03PS3) 10Alexandros Kosiaris: service::uwsgi: fix logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/352552 (owner: 10Giuseppe Lavagetto) [15:50:34] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] service::uwsgi: fix logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/352552 (owner: 10Giuseppe Lavagetto) [15:54:52] 06Operations, 06Release-Engineering-Team (Kanban), 07Security-General: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030#3268114 (10greg) [15:54:54] I got lots of errors on test wikipedia and mwdebug1002 [15:55:04] 50 minutes ago [15:55:21] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review, and 2 others: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#3268125 (10greg) [15:56:32] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review, 06Release-Engineering-Team (Kanban): Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#3268156 (10greg) [16:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170516T1600). Please do the needful. [16:00:38] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 06Release-Engineering-Team (Backlog): Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#3268164 (10greg) [16:00:43] 06Operations, 10Monitoring, 06Release-Engineering-Team (Backlog), 07Tracking, 07Wikimedia-Incident: Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#3268166 (10greg) [16:01:27] (done) [16:07:04] (03CR) 10Volans: [C: 031] "LGTM, compiler diffs are here:" [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [16:10:42] 06Operations, 10DBA, 06Release-Engineering-Team (Watching / External): Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#3268180 (10Jdfo... [16:15:59] (03CR) 10BBlack: [C: 031] cache: use interface_primary instead of eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [16:16:56] (03PS9) 10Volans: cache: use interface_primary instead of eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [16:19:01] 06Operations, 10DBA, 06Release-Engineering-Team (Watching / External): Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#3268191 (10jcre... [16:19:05] (03CR) 10Volans: [C: 032] cache: use interface_primary instead of eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [16:19:49] 06Operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#3268196 (10jcrespo) [16:20:03] (03PS6) 10Paladox: Test: DO NOT MERGE [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [16:22:49] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2098.codfw.wmnet [16:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:02] oh, logs it for me, neato. [16:25:47] (03PS2) 10RobH: decommission mw2098 [puppet] - 10https://gerrit.wikimedia.org/r/353918 [16:27:03] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3268281 (10RobH) [16:27:06] (03CR) 10RobH: [C: 032] decommission mw2098 [puppet] - 10https://gerrit.wikimedia.org/r/353918 (owner: 10RobH) [16:29:04] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3268288 (10RobH) [16:30:53] (03CR) 10Volans: [C: 031] "LGTM, compiler results here: https://puppet-compiler.wmflabs.org/6438/" [puppet] - 10https://gerrit.wikimedia.org/r/350773 (owner: 10Faidon Liambotis) [16:31:21] (03PS2) 10Volans: Add a new interface::alias definition [puppet] - 10https://gerrit.wikimedia.org/r/350773 (owner: 10Faidon Liambotis) [16:32:16] (03PS2) 10RobH: decommission mw2098 (production dns) [dns] - 10https://gerrit.wikimedia.org/r/353920 [16:32:33] (03CR) 10RobH: [C: 032] decommission mw2098 (production dns) [dns] - 10https://gerrit.wikimedia.org/r/353920 (owner: 10RobH) [16:34:33] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3268358 (10RobH) [16:35:12] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3252410 (10RobH) a:05RobH>03Papaul Ok, this is now ready to have the disks wiped, and then pulled from the rack for decommission. Please complete the onsite steps remainin... [16:40:13] !log upgrading mw2190-mw2199 to Linux 4.9 and HHVM 3.18 [16:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:24] (03PS2) 10Dzahn: ci::master: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353357 [16:41:01] (03PS3) 10Faidon Liambotis: Add a new interface::alias definition [puppet] - 10https://gerrit.wikimedia.org/r/350773 [16:41:03] (03PS2) 10Faidon Liambotis: labs::dnsrecursor: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350774 [16:41:07] (03PS2) 10Faidon Liambotis: gerrit: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350776 [16:41:09] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3268379 (10Pcoombe) @DStrine @AndyRussG Can we prioritise working on this again once banner sequencing is done? It would... [16:41:14] (03PS3) 10Faidon Liambotis: phabricator: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350777 [16:41:16] (03PS2) 10Faidon Liambotis: lists: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350778 [16:41:18] (03PS2) 10Faidon Liambotis: cassandra: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350775 [16:42:12] (03CR) 10Dzahn: [C: 032] ci::master: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353357 (owner: 10Dzahn) [16:44:32] (03PS5) 10Dzahn: yubiauth: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/345085 [16:49:37] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3268413 (10jcrespo) So, the metrics, as far as I can see, works without problem without changing the configuration. However, hosts with multi-source now return repli... [16:53:20] (03PS4) 10Volans: Add a new interface::alias definition [puppet] - 10https://gerrit.wikimedia.org/r/350773 (owner: 10Faidon Liambotis) [16:58:13] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [16:59:16] (03CR) 10Volans: [C: 032] Add a new interface::alias definition [puppet] - 10https://gerrit.wikimedia.org/r/350773 (owner: 10Faidon Liambotis) [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170516T1700). Please do the needful. [17:00:13] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [17:00:14] no parsoid deploy today [17:05:30] (03CR) 10Volans: [C: 031] "LGTM, compiler results here: https://puppet-compiler.wmflabs.org/6439/" [puppet] - 10https://gerrit.wikimedia.org/r/350774 (owner: 10Faidon Liambotis) [17:05:33] !log upgrading mw2017/mw2099 to Linux 4.9 and HHVM 3.18 [17:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:54] (03PS3) 10Volans: labs::dnsrecursor: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350774 (owner: 10Faidon Liambotis) [17:09:01] (03CR) 10Volans: [C: 032] labs::dnsrecursor: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350774 (owner: 10Faidon Liambotis) [17:11:03] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3268453 (10Cmjohnson) Logs sent to HP, they're most likely going to want to do a f/w upgrade first. [17:13:48] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3268474 (10jcrespo) Probably good enough for now? {F8100314} [17:16:05] (03PS3) 10Volans: gerrit: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350776 (owner: 10Faidon Liambotis) [17:17:02] 06Operations: Switch to predictable network interface names? - https://phabricator.wikimedia.org/T158429#3268552 (10faidon) After all the patches for T163196, as well as the interface::alias work (Gerrit topic:T163196, topic:interface-alias etc.), the only hardcoded "eth0"s remaining across the tree are P5452 (c... [17:17:12] 06Operations: Switch to predictable network interface names? - https://phabricator.wikimedia.org/T158429#3268554 (10faidon) [17:17:14] 06Operations: Installer assumes eth0 is the used interface - https://phabricator.wikimedia.org/T164444#3268555 (10faidon) [17:17:48] (03PS1) 10Framawiki: Replace one feed at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354024 (https://phabricator.wikimedia.org/T165285) [17:19:06] 06Operations, 10ops-eqiad, 06Labs: Degraded RAID on labstore1003 - https://phabricator.wikimedia.org/T165220#3260322 (10Cmjohnson) The disk has been swapped and is the in the process of rebuilding Enclosure Device ID: 38 Slot Number: 0 Enclosure position: 2 Device Id: 40 WWN: 5000C50025FD9E58 Sequence Num... [17:21:51] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3268647 (10faidon) After another round with ATAC, this is the latest: > PR 1238906 is the original PR for this issue and it was raised by me. This is fixed sta... [17:22:58] 06Operations, 10ops-eqiad, 06Labs: Degraded RAID on labstore1003 - https://phabricator.wikimedia.org/T165220#3260322 (10madhuvishy) @Cmjohnson Thanks for taking care of this! [17:23:24] !log swapping optics asw-c-eqiad xe-8/0/38 T165008 [17:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:32] T165008: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008 [17:23:56] (03PS6) 10Dzahn: yubiauth: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/345085 [17:25:37] (03CR) 10Dzahn: [C: 032] "the only difference on auth* servers is the name of the role (motd) and resources of classes that are instantiated instead of included." [puppet] - 10https://gerrit.wikimedia.org/r/345085 (owner: 10Dzahn) [17:25:51] (03PS7) 10Dzahn: yubiauth: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/345085 [17:29:21] (03CR) 10Dzahn: "double-confirmed after merge: nothing happened on auth1001/2001 except the motd change" [puppet] - 10https://gerrit.wikimedia.org/r/345085 (owner: 10Dzahn) [17:33:03] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3268741 (10RobH) a:05RobH>03Cmjohnson Reply, Chris is CC'd, and needs to reply back with the serial: > Dear Rob, > > Sorry to hear you suffered an outage, I guess it's the motherboard > that will have to be... [17:44:43] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3268752 (10Cmjohnson) @robh the S/N is 275391 P/N 15650170 There is also another number w/barcode jic it's 349320391 [17:46:06] (03CR) 10Volans: [C: 04-1] "Compiler results shows that there is already a problem with the production code." [puppet] - 10https://gerrit.wikimedia.org/r/350776 (owner: 10Faidon Liambotis) [17:54:57] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3268793 (10RobH) I've sent the details on to Wim for support replacement, along with the ship to address details. [18:02:15] 06Operations, 10hardware-requests: CODFW: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3268804 (10faidon) [18:05:44] 06Operations, 10ops-eqiad, 10netops: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008#3268810 (10Cmjohnson) @ayounsi It appears that the optics swap on asw-c did not help...should I replace on cr2? cmjohnson@asw-c-eqiad> show interfaces xe-8/0/38 extensive | match error... [18:09:53] ACKNOWLEDGEMENT - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 14 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris known, cni still being packaged [18:09:53] ACKNOWLEDGEMENT - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 24 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris known, cni still being packaged [18:09:53] ACKNOWLEDGEMENT - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris known, cni still being packaged [18:09:53] ACKNOWLEDGEMENT - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris known, cni still being packaged [18:13:54] 06Operations, 10ops-eqiad: rack and setup ms1307-1348 - https://phabricator.wikimedia.org/T165519#3268837 (10Cmjohnson) [18:14:53] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:15:08] (03PS1) 10Alexandros Kosiaris: Add codfw to kubernetes ganglia_cluster [puppet] - 10https://gerrit.wikimedia.org/r/354031 [18:15:29] (03Abandoned) 10Alexandros Kosiaris: Change the default LVS BGP behavior per service [debs/pybal] - 10https://gerrit.wikimedia.org/r/353525 (owner: 10Alexandros Kosiaris) [18:16:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add codfw to kubernetes ganglia_cluster [puppet] - 10https://gerrit.wikimedia.org/r/354031 (owner: 10Alexandros Kosiaris) [18:18:35] 06Operations, 10ops-eqiad: rack and setup 24 parsoid servers - https://phabricator.wikimedia.org/T165520#3268863 (10Cmjohnson) [18:21:43] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1055 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354032 [18:25:46] (03PS1) 10RobH: decom mira [dns] - 10https://gerrit.wikimedia.org/r/354033 [18:26:07] !log cp1074: run-no-puppet varnish-backend-restart (has high mailbox lag, causing small 503 spikes) [18:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:45] (03CR) 10RobH: [C: 032] decom mira [dns] - 10https://gerrit.wikimedia.org/r/354033 (owner: 10RobH) [18:27:40] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3268900 (10RobH) a:05RobH>03None [18:27:45] (03PS1) 10Dzahn: remove PTR of old lists.wm.org service IP [dns] - 10https://gerrit.wikimedia.org/r/354034 [18:29:25] (03PS1) 10Alexandros Kosiaris: Add kubernetes_codfw cluster as well [puppet] - 10https://gerrit.wikimedia.org/r/354035 [18:29:51] (03CR) 10Alexandros Kosiaris: [C: 032] Add kubernetes_codfw cluster as well [puppet] - 10https://gerrit.wikimedia.org/r/354035 (owner: 10Alexandros Kosiaris) [18:29:54] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add kubernetes_codfw cluster as well [puppet] - 10https://gerrit.wikimedia.org/r/354035 (owner: 10Alexandros Kosiaris) [18:30:29] (03CR) 10Faidon Liambotis: [C: 04-1] "There seems to be a duplicate IPv6 as well" [dns] - 10https://gerrit.wikimedia.org/r/354034 (owner: 10Dzahn) [18:31:31] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1055 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354032 (owner: 10Jcrespo) [18:32:23] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:32:43] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1055 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354032 (owner: 10Jcrespo) [18:32:54] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1055 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354032 (owner: 10Jcrespo) [18:33:11] (03PS1) 10Alexandros Kosiaris: Amend description for kubernetes clusters in hiera [puppet] - 10https://gerrit.wikimedia.org/r/354036 [18:33:32] (03CR) 10Alexandros Kosiaris: [C: 032] Amend description for kubernetes clusters in hiera [puppet] - 10https://gerrit.wikimedia.org/r/354036 (owner: 10Alexandros Kosiaris) [18:33:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Amend description for kubernetes clusters in hiera [puppet] - 10https://gerrit.wikimedia.org/r/354036 (owner: 10Alexandros Kosiaris) [18:33:53] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:34:13] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:34:33] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [18:37:04] milimetric, I see a patch merged but not deployed [18:38:10] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:41:44] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 after reimage (duration: 00m 39s) [18:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:48] (03PS2) 10Dzahn: remove PTR of old lists.wm.org service IP [dns] - 10https://gerrit.wikimedia.org/r/354034 [18:44:04] (03CR) 10Jcrespo: "This seems to be merged but not deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 (owner: 10XXN) [18:45:09] (03PS3) 10Dzahn: use correct mapped IPv6, remove PTR of old lists.wm.org service IP [dns] - 10https://gerrit.wikimedia.org/r/354034 [18:46:00] (03CR) 10Dzahn: "amended to also fix the duplicate IPv6. that means _changing_ it though to the mapped address. both are bound to eth0. i believe the non-m" [dns] - 10https://gerrit.wikimedia.org/r/354034 (owner: 10Dzahn) [18:46:10] RECOVERY - Check correctness of the icinga configuration on tegmen is OK: Icinga configuration is correct [18:46:46] (03CR) 10Jcrespo: "I have already deployed this, but I am going to reset HEAD~2 tin to leave things how I found them: https://wikitech.wikimedia.org/wiki/Ser" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354032 (owner: 10Jcrespo) [18:47:10] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:47:18] (03PS4) 10Dzahn: remove PTR of old lists.wm.org service IP [dns] - 10https://gerrit.wikimedia.org/r/354034 [18:48:16] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3268926 (10RobH) [18:48:30] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3255501 (10RobH) 05Open>03Resolved confirmed addition with all dc vendors [18:49:30] !log rolled back to HEAD~2 on tin to leave things the way I found them [18:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:10] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:54:30] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:57:07] (03CR) 10Dzahn: lists: switch to interface::alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350778 (owner: 10Faidon Liambotis) [19:05:37] (03PS1) 10Dzahn: lists: fix service IPs in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354038 [19:10:24] (03PS2) 10Dzahn: lists: fix service IPs in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354038 [19:14:06] mutante: these are being overriden really by the fermium hiera [19:14:10] (which should probably go entirely?) [19:15:32] mutante: also, the server_ips variable should go in favor of ipaddress/ipaddress6 facts [19:15:43] (03CR) 10Dzahn: "what this will actually change is /etc/exim4/exim4.conf: "list_smtp" section: interface = <; 208.80.154.75 ; 2620:0:861:3::2 ; 208.80.1" [puppet] - 10https://gerrit.wikimedia.org/r/354038 (owner: 10Dzahn) [19:16:35] mutante: if you want to change lists' service IPs, you really need to do that in steps [19:17:02] mutante: first add the PTR, then add the IP to the interface/change the exim config [19:22:39] paravoid: do you think it's right that i change the service IP to the mapped address in general? if v4 is 208.80.154.75 then v6 should be 2620:0:861:3:208:80:154:75 too (not 2620:0:861:3::2) ? [19:22:54] sure, sounds fine to me [19:23:26] and yea @ fermium hiera per host. amending to remove that [19:24:18] i think we added that ::2 IPv6 service IP before we even had the mapped function [19:24:29] I doubt it, the mapped function is really old [19:29:05] (03PS3) 10Dzahn: lists: fix service/server IPs in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354038 [19:44:17] 06Operations, 10ops-eqiad, 06Analytics-Kanban: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3269102 (10Cmjohnson) 05Open>03Resolved The system board has been replaced and the idrac failure has been corrected but now we have a raid bbu issue...creating a new tic... [19:45:04] 06Operations, 10ops-eqiad: analytics1030 failed bbu - https://phabricator.wikimedia.org/T165529#3269104 (10Cmjohnson) [19:48:43] (03Draft1) 10Paladox: redis: Fix redis for stretch [puppet] - 10https://gerrit.wikimedia.org/r/354041 [19:48:57] (03PS2) 10Paladox: redis: Fix redis for stretch [puppet] - 10https://gerrit.wikimedia.org/r/354041 [19:51:18] (03CR) 10Paladox: [C: 031] "Bump." [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [20:10:47] !log mobrovac@tin Started restart [restbase/deploy@d98af6f] (dev-cluster): Apply the revision range delition algorithm - T164865 [20:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:57] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [20:13:30] PROBLEM - Restbase root url on restbase-dev1001 is CRITICAL: connect to address 10.64.0.35 and port 7231: Connection refused [20:14:50] known, ignore ^ [20:16:10] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:19:46] (03PS3) 10Paladox: redis: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/354041 [20:22:10] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [20:22:30] RECOVERY - Restbase root url on restbase-dev1001 is OK: HTTP OK: HTTP/1.1 200 - 15540 bytes in 0.085 second response time [20:22:37] !log mobrovac@tin Started restart [restbase/deploy@d98af6f] (dev-cluster): Apply the revision range deletion algorithm, take 2 - T164865 [20:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:44] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [20:22:59] (03PS4) 10Paladox: redis: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/354041 [20:24:47] (03Draft1) 10Paladox: redis: Remove support for precise [puppet] - 10https://gerrit.wikimedia.org/r/354045 [20:24:49] (03PS2) 10Paladox: redis: Remove support for precise [puppet] - 10https://gerrit.wikimedia.org/r/354045 [20:27:37] (03CR) 10Dzahn: [C: 031] redis: Remove support for precise [puppet] - 10https://gerrit.wikimedia.org/r/354045 (owner: 10Paladox) [20:28:58] (03CR) 10Dzahn: [C: 031] "the file name does not show up in puppet manifests because the code is "source => "puppet:///modules/redis/redis-${::lsbdistcodename}.conf" [puppet] - 10https://gerrit.wikimedia.org/r/354045 (owner: 10Paladox) [20:30:34] (03CR) 10Faidon Liambotis: [C: 04-1] "Copying the file around is the not DRY and thus not the right solution (unless we expect them to differ significantly). If the config is t" [puppet] - 10https://gerrit.wikimedia.org/r/354041 (owner: 10Paladox) [20:30:50] (03CR) 10Dzahn: "this happens because the code is "source => "puppet:///modules/redis/redis-${::lsbdistcodename}.conf" for distro-specific settings. the qu" [puppet] - 10https://gerrit.wikimedia.org/r/354041 (owner: 10Paladox) [20:33:25] (03CR) 10Faidon Liambotis: [C: 04-1] "I'd split the removal of old cruft from the renumbering of the new service IPv6. The latter will probably need to be done in steps, affect" [dns] - 10https://gerrit.wikimedia.org/r/354034 (owner: 10Dzahn) [20:36:16] Hi. [20:37:17] Re: https://gerrit.wikimedia.org/r/#/c/352728/6 [20:37:23] (03CR) 10Faidon Liambotis: [C: 04-1] "Same here -- multiple commits needed:" [puppet] - 10https://gerrit.wikimedia.org/r/354038 (owner: 10Dzahn) [20:37:37] mutante: ^^^ I'm here if you want to discuss all these [20:37:38] Indeed, there is nothing at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:45] So we've two solutions: deploy it, revert it [20:37:58] milimetric: ping? [20:38:14] Dereckson: yeah [20:38:27] paravoid: ok, i started with a new commit that has "step 1" in the name.. then step 2.. [20:38:31] milimetric is probably at the analytics offsite [20:38:33] let's ping releng [20:38:45] paravoid: jynus did that [20:39:07] greg-g, RainbowSprinkles etc.: see https://gerrit.wikimedia.org/r/#/c/352728/6 [20:39:32] greg-g, RainbowSprinkles: merged but not deployed [20:39:41] Well, I'm going to revert it. [20:39:42] if noone responds in the next 10 minutes or so, let's just revert [20:39:54] This is a task to act on a *2014* request. [20:40:07] whoops :) [20:40:23] We don't know how the *2017* community position. [20:40:43] (03PS1) 10Dzahn: fix lists/fermium: step 1, add PTR for new v6 service IP [dns] - 10https://gerrit.wikimedia.org/r/354046 [20:41:08] (03PS5) 10Paladox: redis: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/354041 [20:42:22] (03CR) 10Faidon Liambotis: [C: 04-1] "How would that even work? Presumably the settings in redis-jessie.conf cannot be used with trusty's version?" [puppet] - 10https://gerrit.wikimedia.org/r/354041 (owner: 10Paladox) [20:42:36] Dereckson: yeah, revert please [20:43:33] (03CR) 10Paladox: "> How would that even work? Presumably the settings in" [puppet] - 10https://gerrit.wikimedia.org/r/354041 (owner: 10Paladox) [20:43:38] (03PS1) 10Dereckson: Revert "Fixing "Book_talk" namespace alias for ro.wikipedia:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354047 [20:43:57] (03CR) 10Dereckson: [C: 032] Revert "Fixing "Book_talk" namespace alias for ro.wikipedia:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354047 (owner: 10Dereckson) [20:44:54] (03Merged) 10jenkins-bot: Revert "Fixing "Book_talk" namespace alias for ro.wikipedia:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354047 (owner: 10Dereckson) [20:45:28] (03PS6) 10Paladox: redis: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/354041 [20:45:47] (03CR) 10jenkins-bot: Revert "Fixing "Book_talk" namespace alias for ro.wikipedia:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354047 (owner: 10Dereckson) [20:45:56] (03CR) 10Paladox: "I couldn't think of a good name for the new redis config file so i went with redis-os." [puppet] - 10https://gerrit.wikimedia.org/r/354041 (owner: 10Paladox) [20:46:22] (03CR) 10jerkins-bot: [V: 04-1] redis: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/354041 (owner: 10Paladox) [20:46:43] dude seriously, can you think things through before you submit patches like that? [20:46:49] that's paladox ^^ [20:47:17] paravoid yes, i tested the prevous change, the patchset you first -1 [20:47:20] I appreciate your willingness to help, and I'm happy to guide you to things that are new or foreign to you [20:47:38] and it's ok if it takes more time than if I would had done it [20:47:46] ok [20:47:47] (03PS1) 10Dzahn: fix lists/fermium: step 2, remove wrong PTR for service IP [dns] - 10https://gerrit.wikimedia.org/r/354048 [20:47:49] but it's not OK if you're not spending more than 5 seconds to think a patch before you submit it [20:48:07] marostegui: ping? [20:48:19] your latest patch puts /etc/redis/redis.conf under a >= jessie conditional, what do you think it will happen if this manifest is applied on a trusty system? [20:49:05] (03CR) 10Faidon Liambotis: [C: 04-1] "Missing the removal of the IPv6 again :)" [dns] - 10https://gerrit.wikimedia.org/r/354048 (owner: 10Dzahn) [20:49:07] (03PS2) 10Dzahn: fix lists/fermium: step 2, remove wrong PTR for service IP [dns] - 10https://gerrit.wikimedia.org/r/354048 [20:49:27] paravoid: Dereckson thanks, we just got back from dinner [20:49:38] Oh, i thought doing that would mean that it would require jessie or higher. Was i wrong and instead it does the oposite to what i want? [20:49:38] (re that merged but not deployed patch that was reverted) [20:50:01] paladox: it means that the File resource will only be applied on jessie or higher [20:50:04] Now, the tricky follow-up question, what we do with db1055 (jynus merged a change to repool it, before 'rolled back to HEAD~2 on tin', but this repool change is still in the queue) [20:50:28] paladox: which means that the file won't be created at all on trusty systems [20:50:30] oh, i see [20:50:50] Dereckson: I'd let him take care of that [20:51:44] Should i create the redis.conf file with the debian check for specific configs that we want against debian but not on trusty? Ie a redis.erb template? [20:52:09] paravoid: eh, but i split it into the small steps on purpose now, so i did not want to add the removal of the v6 address to that.. [20:52:24] will amend [20:52:29] mutante: there are two IPv6 addresses with PTR right now [20:52:39] mutante: one that is actually used, and one that is not [20:52:41] (03PS1) 10Dereckson: Revert "Revert "mariadb: Depool db1055 for reimage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354049 [20:53:09] (03CR) 10Dereckson: [C: 032] "To restore the operations/mediawiki-config prod/repo parity (at b40cd334d92)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354049 (owner: 10Dereckson) [20:53:31] mutante: 2620:0:861:1::2 and 2620:0:861:3::2 [20:53:58] mutante: step one is to remove 208.80.154.4 and 2620:0:861:1::2 from DNS [20:54:08] (03Merged) 10jenkins-bot: Revert "Revert "mariadb: Depool db1055 for reimage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354049 (owner: 10Dereckson) [20:54:14] mutante: step two is to renumber 2620:0:861:3::2 to the mapped IP, and is entirely orthogonal to all this [20:54:37] marostegui: so operations/mediawiki-config master is now clean and in working state [20:54:46] mutante: 208.80.154.4 and 2620:0:861:1::2 are not being used at all, they are just cruft on ops/dns [20:55:44] (03CR) 10jenkins-bot: Revert "Revert "mariadb: Depool db1055 for reimage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354049 (owner: 10Dereckson) [20:56:34] (03CR) 10Dereckson: "@Millimetric: For changes in operations/mediawiki-config repository, please follow the https://wikitech.wikimedia.org/wiki/SWAT_deploys pr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 (owner: 10XXN) [20:57:01] paravoid: ok. yea, i see those are not used. it's just about the right amount of things per patch. *nod* to all that [20:58:01] mutante: yeah, I'd say that the lines are broadly about 1) removing cruft 2) reworking hiera variables 3) renumbering lists' IPv6 [20:58:15] three different purposes really :) [20:59:40] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:00:39] (03PS3) 10Dzahn: fix lists/fermium: step 1, remove wrong PTRs for service IP [dns] - 10https://gerrit.wikimedia.org/r/354048 [21:00:50] (03PS2) 10Dzahn: fix lists/fermium: step 2, add PTR for new v6 service IP [dns] - 10https://gerrit.wikimedia.org/r/354046 [21:02:49] (03CR) 10Faidon Liambotis: [C: 032] fix lists/fermium: step 1, remove wrong PTRs for service IP [dns] - 10https://gerrit.wikimedia.org/r/354048 (owner: 10Dzahn) [21:04:04] ok:) submits that [21:05:21] (03PS3) 10Dzahn: fix lists/fermium: step 2, add PTR for new v6 service IP [dns] - 10https://gerrit.wikimedia.org/r/354046 [21:07:01] mutante: minor typo in the commit message, s/74/75/ in the first mention [21:07:42] (03PS4) 10Dzahn: fix lists/fermium: step 2, add PTR for new v6 service IP [dns] - 10https://gerrit.wikimedia.org/r/354046 [21:08:02] fixed, and "step 1"->"step 2" [21:08:05] (03CR) 10Faidon Liambotis: [C: 032] fix lists/fermium: step 2, add PTR for new v6 service IP [dns] - 10https://gerrit.wikimedia.org/r/354046 (owner: 10Dzahn) [21:11:11] now about Hiera, i am thinking "lists_ip" should be in role/common, but "server_ip" should be in hosts/ [21:11:26] see my comment in one of the patchsets [21:11:27] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3269263 (10Cmjohnson) [21:11:29] 06Operations, 10ops-eqiad: Analytics1040 system board repair needed - https://phabricator.wikimedia.org/T164942#3269261 (10Cmjohnson) 05Open>03Resolved The new motherboard has been added. System is back online. Resolving this task. [21:11:36] server_ip should go in favor of ipaddress/ipaddress6 facts [21:11:42] oh, right, i saw that [21:14:20] RECOVERY - Host analytics1040 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [21:14:30] RECOVERY - Host analytics1030 is UP: PING OK - Packet loss = 0%, RTA = 36.64 ms [21:17:10] PROBLEM - Hadoop NodeManager on analytics1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [21:17:12] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1030 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [21:25:36] (03PS1) 10Dzahn: lists: use $::ipaddress facts instead of server IP in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354051 [21:27:17] paravoid: ^ i am using the _eth0 variant specifically, an array of the v4 and the v6 IP. also, should it still stay a parameter of the profile or nah [21:27:26] why the _eth0 variant? [21:27:31] don't do that, no [21:27:48] i just saw that i have both options when i ran "facter | grep ipaddres" and it seemed more specific [21:27:51] ok [21:28:01] we literally just went through the herculean task of getting rid of all the _eth0 ones :) [21:28:06] finished just today :) [21:28:22] heh, ok:) [21:28:23] also $facts['ipaddress'] is the new notation [21:28:40] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [21:28:43] oh, TIL [21:28:44] ( https://phabricator.wikimedia.org/T163196 are the gory details if you're interested ) [21:29:37] 06Operations: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3269313 (10faidon) [21:30:10] 06Operations: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3189426 (10faidon) I think all of the changes described here have been merged ­-- right @volans? If you agree, want to do the honors of resolving this? [21:30:51] mutante: also, probably the right change would be to get rid of the variable and move it to the .erb [21:31:48] paravoid: i still have to merge $facts['ipaddress'] and $facts['ipaddress6'] into one array ? [21:32:00] interface = <; <%= @outbound_ips.join(" ; ") %> [21:32:09] looks like it, but the better option would be to modify the template instead [21:32:16] and get rid of $outbound_ips altogether [21:32:17] 06Operations: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3269318 (10Volans) 05Open>03Resolved a:03faidon [21:32:40] ok, looking at the template part, at first tried to avoid changing that [21:33:13] why [21:33:43] interface = <; <%= @ipaddress %> ; <%= @ipaddress6 %> [21:34:19] fair enough :) [21:34:26] not sure if @facts['ipaddress'] works tbh [21:37:00] meh, works in puppet 4 but not in 3.8 [21:39:18] (03PS2) 10Dzahn: lists: use $::ipaddress facts instead of server IP in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354051 [21:41:08] (03PS3) 10Dzahn: lists: use $::ipaddress facts instead of server IP in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354051 [21:41:10] .. tabs [21:42:55] (03PS4) 10Dzahn: lists: use $::ipaddress facts instead of server IP in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354051 [21:43:16] (03CR) 10Faidon Liambotis: [C: 031] "LGTM, but I'd check with the compiler and/or apply manually after merging to check if the diff makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/354051 (owner: 10Dzahn) [21:45:21] mutante: also, use 'facter -p', not facter [21:45:44] mutante: the former includes more facts (the ones we have in puppet), and in the case of ipaddress/ipaddress6 especially, we override the default facts [21:46:02] so on some systems the value of 'facter ipaddress' and 'facter -p ipaddress' will differ [21:46:35] oh, interesting! thanks [21:46:52] compiler says: [21:46:53] - interface = <; 208.80.154.74 ; 2620:0:861:3:208:80:154:74 ; 208.80.154.61 ; 2620:0:861:1:208:80:154:61 [21:46:56] + interface = 208.80.154.74 ; 2620:0:861:3::2 [21:47:01] there is that literal <; [21:47:10] oh yes, that's needed [21:47:23] i first had it and then it looked like my own typo, heh [21:47:24] but why were the lists IPs there? [21:47:48] oh [21:47:50] lol [21:48:01] hiera used IPs from both yamls [21:48:02] jesus [21:48:15] :o merging them... [21:48:19] no [21:48:23] the other ones are the old ones [21:50:10] also, it uses 2620:0:861:3::2 because meh, we need to use interface::alias [21:50:13] rgrr [21:50:19] (03PS5) 10Dzahn: lists: use $::ipaddress facts instead of server IP in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354051 [21:51:19] let me [21:51:20] sorry [21:51:23] easier done than explained :P [21:51:53] please do, ok [21:52:04] i just added the "<;" [21:52:13] and now http://puppet-compiler.wmflabs.org/6444/ [21:52:51] (03PS3) 10Faidon Liambotis: lists: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350778 [21:54:04] (03PS4) 10Faidon Liambotis: lists: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350778 [21:54:44] (03CR) 10Faidon Liambotis: [C: 032] lists: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350778 (owner: 10Faidon Liambotis) [21:55:08] (03CR) 10Dzahn: "there was this diff: http://puppet-compiler.wmflabs.org/6444/fermium.wikimedia.org/ now https://gerrit.wikimedia.org/r/350778 came firs" [puppet] - 10https://gerrit.wikimedia.org/r/354051 (owner: 10Dzahn) [21:56:16] ugh [21:57:34] (03PS1) 10Faidon Liambotis: lists: brown paper bag fix for 128c6df [puppet] - 10https://gerrit.wikimedia.org/r/354053 [21:58:30] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:58:38] yeah yeah [21:59:34] (03CR) 10Faidon Liambotis: [C: 032] lists: brown paper bag fix for 128c6df [puppet] - 10https://gerrit.wikimedia.org/r/354053 (owner: 10Faidon Liambotis) [21:59:51] ACKNOWLEDGEMENT - puppet last run on fermium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn in progress [22:00:17] aaah @ requires.. [22:01:30] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:02:42] (03PS6) 10Faidon Liambotis: lists: use ipaddress facts instead of server IP in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354051 (owner: 10Dzahn) [22:03:09] 06Operations, 06MediaWiki-Platform-Team, 06Performance-Team, 07Availability (Multiple-active-datacenters), and 6 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3269464 (10Volans) @aaron @tstarling @Joe: here is a minimal list of fail... [22:03:35] see, ipaddress6 changed now :) [22:03:44] putting it again against the compiler [22:04:10] :) [22:04:28] oh that doesn't work [22:04:35] we need to update the facts on the compiler, *sigh* [22:05:32] that's easy [22:05:41] since the latest fixes I've made ;) [22:05:50] are the docs @ https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet3-diffs accurate? [22:06:14] yes, but if you use strict check [22:06:22] ? [22:06:23] you need to replace the puppet masters [22:06:26] with 1001 and 2001 [22:06:37] in your ssh config [22:07:02] (03PS1) 10Dzahn: fix lists/fermium: step 3, add new service IP, additionally [puppet] - 10https://gerrit.wikimedia.org/r/354055 [22:07:07] puppet.eqiad.wmnet is a CNAME to puppetmaster1001.eqiad.wmnet [22:07:16] and same for codfw -> 2001 [22:07:34] I don't have access the the labs project [22:07:45] so I can only connect to compiler02.puppet3-diffs.eqiad.wmflabs as root [22:07:57] can you really quickly do that for me? :) [22:08:16] sure :) (and I didn't know about this different access) [22:08:30] PROBLEM - NTP on analytics1030 is CRITICAL: NTP CRITICAL: Offset unknown [22:08:48] also I'll amend https://gerrit.wikimedia.org/r/#/c/354038/3 first [22:08:55] and merge that first, so that we have a better baseline [22:09:03] has puppet run already on the hosts that you want the facts updated for? [22:09:10] yes [22:09:15] (fermium) [22:09:26] volans: labs/private / modules/secret/secrets/ssh/root-authorized-keys [22:09:46] that gives you the root access to all labs instances [22:10:26] mutante: shouldn't https://gerrit.wikimedia.org/r/#/c/354034/ be aborted? [22:11:22] paravoid: yes, it should, i just had it open reading your comment on the steps [22:11:50] paravoid: facts updated [22:11:54] (03Abandoned) 10Dzahn: remove PTR of old lists.wm.org service IP [dns] - 10https://gerrit.wikimedia.org/r/354034 (owner: 10Dzahn) [22:12:46] the script should work also connecting as root AFAICT [22:13:00] it uses sudo all around but should not harm [22:14:27] (03PS4) 10Faidon Liambotis: lists: fix service/server IPs in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354038 (owner: 10Dzahn) [22:14:29] (03PS7) 10Faidon Liambotis: lists: use ipaddress facts instead of server IP in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354051 (owner: 10Dzahn) [22:14:30] ok, better [22:14:31] what a mess [22:16:58] (03CR) 10Faidon Liambotis: [C: 032] lists: fix service/server IPs in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354038 (owner: 10Dzahn) [22:18:35] :) was doing the same. cool [22:19:15] facts don't seem updated [22:19:19] but I'll proceed anyway [22:19:22] fingers crossed :P [22:19:32] (03CR) 10Faidon Liambotis: [C: 032] lists: use ipaddress facts instead of server IP in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/354051 (owner: 10Dzahn) [22:24:26] I've slightly improved the docs at https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet3-diffs [22:24:32] volans: it didn't work fwiw [22:24:36] it used the old facts [22:24:59] mmmh strange, let me check, it was fermium right? [22:25:02] yes [22:26:11] (03PS2) 10Dzahn: fix lists/fermium: step 3, add new service IP, additionally [puppet] - 10https://gerrit.wikimedia.org/r/354055 [22:26:22] fermium.eqiad or fermium.wikimedia? (there are facts for both :( ) [22:26:42] fermium.wikimedia.org, there is no fermium.eqiad.wmnet [22:26:51] (03PS1) 10Faidon Liambotis: lists: split mailman::lists_ip variable into v4/v6 [puppet] - 10https://gerrit.wikimedia.org/r/354058 [22:27:22] (03PS3) 10Dzahn: fix lists/fermium: step 3, add new service IP, additionally [puppet] - 10https://gerrit.wikimedia.org/r/354055 [22:27:55] mutante: no, you can't add multiple IPv6, you'll just have to change the address [22:28:14] mutante: but do it on top of https://gerrit.wikimedia.org/r/#/c/354058/ [22:29:18] ok, was trying to follow the steps from comments on the abandoned change [22:29:19] there are facts for fermium.eqiad.wmnet too on the compiler, dated Apr 28th, I'm deleting them [22:29:45] mutante: see the steps from my comments on https://gerrit.wikimedia.org/r/354038 [22:29:50] volans: Apr 28th of which year? [22:30:00] not this year for sure [22:30:16] yeah was this year [22:30:24] wtf? that can't be [22:30:36] but I've checked they are not on the puppetmasters [22:30:46] (03CR) 10Dzahn: [C: 04-1] "can't add 2 IPv6's at once, has to change in one step" [puppet] - 10https://gerrit.wikimedia.org/r/354055 (owner: 10Dzahn) [22:30:50] anyway should not be the problem [22:31:02] which variable was not updated? [22:31:06] ipaddress6 [22:32:11] ipaddress6: "2620:0:861:3::2" [22:32:13] ipaddress6_eth0: "2620:0:861:3:208:80:154:74" [22:32:27] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3269541 (10RobH) [22:32:30] yeah but [22:32:34] root@fermium:~# facter -p |grep ipaddress6 [22:32:34] ipaddress6 => 2620:0:861:3:208:80:154:74 [22:32:34] ipaddress6_eth0 => 2620:0:861:3:208:80:154:74 [22:33:56] mutante: you probably also need to drop the TTL on the AAAA first :) [22:34:03] i have never seen fermium.eqiad anywhere, that's weird how it got in there. it's always been .wikimedia.org [22:34:37] paravoid: on puppetmaster1001: [22:34:38] # grep ipaddress6 fermium.wikimedia.org.yaml [22:34:38] ipaddress6: "2620:0:861:3::2" [22:34:38] ipaddress6_eth0: "2620:0:861:3:208:80:154:74" [22:34:53] huh? [22:34:55] -rw-rw---- 1 puppet puppet 22473 May 16 21:56 fermium.wikimedia.org.yaml [22:35:01] is puppet running over there? [22:35:05] well [22:35:09] it probably hit the other puppetmaster? :) [22:35:20] no is in equiad [22:35:29] 1002 I mean [22:35:35] that's a backend [22:35:38] so? [22:36:07] AFAIK it has to hit one of the two frontends [22:36:22] (03PS1) 10Dereckson: Revert "Revert "Revert "mariadb: Depool db1055 for reimage""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354062 [22:36:38] otherwise the facts sync for the compiler makes no sense, we'll need to cycle over all of them [22:37:13] well [22:37:29] puppet did run there, and the config file did use ipaddress6 as reported on the host (the new one) [22:37:31] (03CR) 10Dereckson: [C: 032] "Previous commit ensured tin/repo parity, but tin state wasn't prod state. This ensures prod/repo parity." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354062 (owner: 10Dereckson) [22:37:44] but my theory was wrong too [22:37:47] root@puppetmaster1002:~# grep ipaddress6 /var/lib/puppet/yaml/facts/fermium.wikimedia.org.yaml ipaddress6: "2620:0:861:3::2" [22:38:02] ok [22:38:05] now in another puppet run [22:38:07] fermium 0 ~$ ls -la /var/log/puppet.log [22:38:08] -rw------- 1 root root 23789 May 16 22:14 /var/log/puppet.log [22:38:11] root@puppetmaster1002:~# grep ipaddress6 /var/lib/puppet/yaml/facts/fermium.wikimedia.org.yaml ipaddress6: "2620:0:861:3:208:80:154:74" [22:38:14] root@puppetmaster1001:~# grep ipaddress6 /var/lib/puppet/yaml/facts/fermium.wikimedia.org.yaml ipaddress6: "2620:0:861:3::2" [22:38:29] so yeah, we need to have enough puppet runs to hit all puppetmasters? :) [22:38:42] running again [22:38:45] (03Merged) 10jenkins-bot: Revert "Revert "Revert "mariadb: Depool db1055 for reimage""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354062 (owner: 10Dereckson) [22:38:48] and there we go [22:38:54] (03CR) 10jenkins-bot: Revert "Revert "Revert "mariadb: Depool db1055 for reimage""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354062 (owner: 10Dereckson) [22:39:15] (03PS1) 10Dzahn: lists: lower TTL for service IP change [dns] - 10https://gerrit.wikimedia.org/r/354064 [22:39:16] root@puppetmaster1001:~# grep ipaddress6 /var/lib/puppet/yaml/facts/fermium.wikimedia.org.yaml ipaddress6_eth0: "2620:0:861:3:208:80:154:74" [22:39:17] mmmh strange, I can check with joe/alex tomorrow [22:39:29] (03CR) 10jerkins-bot: [V: 04-1] lists: lower TTL for service IP change [dns] - 10https://gerrit.wikimedia.org/r/354064 (owner: 10Dzahn) [22:39:32] pretty sure that facts updates happen by the backend [22:40:13] 06Operations, 10ops-eqiad, 10Analytics: SATA errors for stat1004 in the dmesg - https://phabricator.wikimedia.org/T162770#3269545 (10Cmjohnson) @elukey we will need to coordinate a time to try and replace the sata cable and/or check settings. [22:40:29] heh, DNS lint says i cant lower it _just_ for AAAA. oh well ..."All TTLs for A and/or AAAA records at the same name should agree (using 3600)" [22:41:47] (03PS2) 10Dzahn: lists: lower TTL for service IP change [dns] - 10https://gerrit.wikimedia.org/r/354064 [22:42:40] !log Tin has now an up-to-date /srv/mediawiki-staging HEAD, with operations/mediawiki-config repo = prod = staging [22:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:00] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4021 is CRITICAL: connect to address 10.128.0.121 and port 3128: Connection refused [22:46:31] * elukey checks analytics1030 and 1040 [22:46:50] PROBLEM - puppet last run on cp4021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:48:50] bblack: I'm assuming this is your load test ;) ^^^ [22:48:58] was about to ask :) [22:49:11] (03PS4) 10Dzahn: fix lists/fermium: switch v6 service IP [puppet] - 10https://gerrit.wikimedia.org/r/354055 [22:49:20] paravoid: facts updated again, in case you need them. I'll check tomorrow for the real story [22:51:50] PROBLEM - puppet last run on cp4021 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:52:29] cmjohnson1: o/ - should we keep analytics1030 down or is it ready? I saw it online but also your message about the raid bbu [22:56:10] PROBLEM - Hadoop DataNode on analytics1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [22:59:25] (acked and disabled puppet) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170516T2300). [23:00:54] 06Operations, 10ops-eqiad, 06Analytics-Kanban: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3269575 (10elukey) @Cmjohnson I just disabled all hadoop services and puppet on the host, from what I can read we'd need more hw maintenance right? [23:01:40] (03PS2) 10Faidon Liambotis: lists: split mailman::lists_ip variable into v4/v6 [puppet] - 10https://gerrit.wikimedia.org/r/354058 [23:03:07] * volans off to bed, cya [23:04:04] (03CR) 10Faidon Liambotis: [C: 032] "Noop according to the compiler, http://puppet-compiler.wmflabs.org/6452/fermium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/354058 (owner: 10Faidon Liambotis) [23:04:47] ciao volans [23:05:30] ciao, did it work this time? :) [23:05:33] yup! [23:05:43] (03PS4) 10Faidon Liambotis: phabricator: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350777 [23:05:45] (03PS3) 10Faidon Liambotis: cassandra: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350775 [23:05:49] (03PS4) 10Faidon Liambotis: gerrit: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350776 [23:06:03] great [23:06:59] * elukey off too! [23:07:05] 06Operations, 10ops-eqiad, 06Analytics-Kanban: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3269591 (10Cmjohnson) @elukey yes we need to replace the bbu on the raid controller [23:07:37] elukey: just leave it down for the remainder of the week [23:09:18] (03CR) 10Faidon Liambotis: [C: 032] "http://puppet-compiler.wmflabs.org/6453/ -- differences as expected." [puppet] - 10https://gerrit.wikimedia.org/r/350777 (owner: 10Faidon Liambotis) [23:10:16] root@iridium:~# ip ad ls |grep 10.64 inet 10.64.32.150/22 brd 10.64.35.255 scope global eth0 inet 10.64.32.186/21 scope global eth0 [23:10:19] er [23:10:21] root@iridium:~# ip ad ls |grep 10.64 [23:10:23] inet 10.64.32.150/22 brd 10.64.35.255 scope global eth0 [23:10:24] inet 10.64.32.186/21 scope global eth0 [23:10:29] *sigh* *sigh* [23:14:26] volans: yes :) [23:15:03] wow iridium has some fancy ip magic there :) [23:15:14] yeah, I fixed it [23:15:24] it was a bad idea to let callers specify the netmask in most cases [23:15:30] we have interface::alias now [23:15:47] the only one that can be converted but I didn't convert is the authdns one, btw [23:15:58] because it uses create_resources magic there :) [23:16:19] (03PS2) 10Dereckson: Replace one feed at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354024 (https://phabricator.wikimedia.org/T165285) (owner: 10Framawiki) [23:18:15] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354024 (https://phabricator.wikimedia.org/T165285) (owner: 10Framawiki) [23:18:47] thedj: ping? [23:19:41] (03Merged) 10jenkins-bot: Replace one feed at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354024 (https://phabricator.wikimedia.org/T165285) (owner: 10Framawiki) [23:19:50] (03CR) 10jenkins-bot: Replace one feed at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354024 (https://phabricator.wikimedia.org/T165285) (owner: 10Framawiki) [23:21:49] (03CR) 10Faidon Liambotis: [C: 032] "Updated PCC: http://puppet-compiler.wmflabs.org/6454/ & http://puppet-compiler.wmflabs.org/6455/" [puppet] - 10https://gerrit.wikimedia.org/r/350775 (owner: 10Faidon Liambotis) [23:25:43] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Update wmde-policy RSS feed on meta. (T165285) (duration: 00m 39s) [23:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:51] T165285: Change the URL of WMDE Policy Team News RSS feed on Meta - https://phabricator.wikimedia.org/T165285 [23:26:38] (03PS1) 10Dzahn: add service IP to be used by gerrit slave/standby [dns] - 10https://gerrit.wikimedia.org/r/354068 [23:27:55] (03CR) 10Faidon Liambotis: [C: 032] add service IP to be used by gerrit slave/standby [dns] - 10https://gerrit.wikimedia.org/r/354068 (owner: 10Dzahn) [23:31:50] RECOVERY - HP RAID on ms-be1038 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [23:33:54] (03PS1) 10Dzahn: gerrit: codfw, use service IP for gerrit-slave, not server IP [puppet] - 10https://gerrit.wikimedia.org/r/354070 [23:34:53] (03CR) 10Faidon Liambotis: [C: 032] gerrit: codfw, use service IP for gerrit-slave, not server IP [puppet] - 10https://gerrit.wikimedia.org/r/354070 (owner: 10Dzahn) [23:35:02] (03CR) 10Dereckson: "What do you want we do with this change?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349733 (https://phabricator.wikimedia.org/T115170) (owner: 10TheDJ) [23:36:46] (03CR) 10Dereckson: "By the way, current file extensions array doens't contain mp3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349733 (https://phabricator.wikimedia.org/T115170) (owner: 10TheDJ) [23:38:09] (03PS5) 10Faidon Liambotis: gerrit: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350776 [23:39:28] (03CR) 10Faidon Liambotis: [C: 032] gerrit: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350776 (owner: 10Faidon Liambotis) [23:48:34] (03PS1) 10Dzahn: lists: switch v6 service IP [dns] - 10https://gerrit.wikimedia.org/r/354071 [23:51:01] (03PS1) 10Dzahn: lists: raise TTL back to 1H after service IP change [dns] - 10https://gerrit.wikimedia.org/r/354072 [23:55:55] (03CR) 10Dzahn: "The ticket seems to be about en-abling it and https://www.iis.fraunhofer.de/en/ff/amm/prod/audiocodec/audiocodecs/mp3.html sounds like al" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349733 (https://phabricator.wikimedia.org/T115170) (owner: 10TheDJ) [23:56:08] (03PS1) 10Faidon Liambotis: authdns: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/354073 [23:56:17] bblack: https://gerrit.wikimedia.org/r/354073