[00:25:40] PROBLEM - swift-container-updater on ms-be1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:30] RECOVERY - swift-container-updater on ms-be1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [00:28:38] (03PS1) 10Dzahn: gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 [00:30:49] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3209252 (10faidon) @Cmjohnson I have heard of batteries issues from other HPE users. Could you do a visual inspection of the battery on those systems and see whethe... [00:32:56] (03PS1) 10Dzahn: gerrit: rename "server" IP to "service" IP [puppet] - 10https://gerrit.wikimedia.org/r/354075 [00:36:34] (03PS1) 10Dzahn: gerrit: switch SSHD from port 29418 to 22 [puppet] - 10https://gerrit.wikimedia.org/r/354076 [01:05:34] (03PS1) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [01:06:43] (03CR) 10jerkins-bot: [V: 04-1] gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [01:06:44] (03PS1) 10Faidon Liambotis: raid/hpssacli: WARN on permanently disabled cache [puppet] - 10https://gerrit.wikimedia.org/r/354079 (https://phabricator.wikimedia.org/T163998) [01:06:47] (03PS1) 10Faidon Liambotis: raid/hpssacli: check for cable errors/no batteries [puppet] - 10https://gerrit.wikimedia.org/r/354080 (https://phabricator.wikimedia.org/T163998) [01:08:18] (03PS2) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [01:10:35] (03CR) 10Faidon Liambotis: [C: 04-1] "This is going to break the checkouts of… pretty much everyone. Can't we listen on both 29418 and 22? (if not, iptables REDIRECT is our fri" [puppet] - 10https://gerrit.wikimedia.org/r/354076 (owner: 10Dzahn) [01:13:10] (03CR) 10Faidon Liambotis: [C: 031] "Sounds fine but unrelatedly…" [puppet] - 10https://gerrit.wikimedia.org/r/354075 (owner: 10Dzahn) [01:25:36] (03PS1) 10Faidon Liambotis: cassandra/aqs: drop Hiera values equal to defaults [puppet] - 10https://gerrit.wikimedia.org/r/354081 [02:19:59] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 06m 58s) [02:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:59] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed May 17 02:25:59 UTC 2017 (duration 6m 1s) [02:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:50] RECOVERY - HP RAID on ms-be1032 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [02:53:15] (03PS1) 10Faidon Liambotis: Rewrite the LLDP fact(s) [puppet] - 10https://gerrit.wikimedia.org/r/354084 [02:54:08] (03CR) 10jerkins-bot: [V: 04-1] Rewrite the LLDP fact(s) [puppet] - 10https://gerrit.wikimedia.org/r/354084 (owner: 10Faidon Liambotis) [02:54:12] (03CR) 10Faidon Liambotis: "Note that this is a regression (no monitoring parents) on trusty systems. We should probably fix this by backporting Facter 2 to trusty sy" [puppet] - 10https://gerrit.wikimedia.org/r/354084 (owner: 10Faidon Liambotis) [05:58:40] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=296.40 Read Requests/Sec=1028.50 Write Requests/Sec=2.90 KBytes Read/Sec=36389.20 KBytes_Written/Sec=1296.80 [06:01:16] !log Resume pt-table-checksum on s7.centralauth - https://phabricator.wikimedia.org/T163190 [06:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:40] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.50 Read Requests/Sec=0.20 Write Requests/Sec=0.50 KBytes Read/Sec=2.00 KBytes_Written/Sec=4.40 [06:14:41] (03PS1) 10Marostegui: db-codfw.php: Repool db2041 and db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354085 (https://phabricator.wikimedia.org/T162611) [06:19:09] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2041 and db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354085 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:20:08] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2041 and db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354085 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:20:17] (03CR) 10jenkins-bot: db-codfw.php: Repool db2041 and db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354085 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:22:27] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2041 and db2049 - T162611 (duration: 00m 39s) [06:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:37] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [06:26:05] !log Deploy alter table on s2 (revision table) db2017 (codfw master) - https://phabricator.wikimedia.org/T1626111 [06:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:30] !log Deploy alter table on s2 (revision table) dbstore1002 - T162611 [06:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:39] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [06:54:28] !log Drop already renamed tables from silver (labswiki) - T164887 [06:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:37] T164887: Drop Semantic Database tables from wikitech wikis - https://phabricator.wikimedia.org/T164887 [06:56:10] !log Drop already renamed tables from labtestweb2001 (labtestwiki) - T164887 [06:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:40] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:57:50] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:08] ^ backups probably [06:58:11] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:11] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:11] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:11] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:11] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:11] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:12] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:12] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:12] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:13] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:13] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:20] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:20] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:20] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:21] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:30] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:30] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:30] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:00] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:59:01] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:59:01] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:59:01] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:59:01] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:59:01] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:59:01] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:59:02] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:59:02] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:59:03] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:59:03] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:59:10] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:59:20] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [06:59:20] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:59:20] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [06:59:20] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:59:20] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:59:20] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:59:31] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:59:43] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [07:04:50] PROBLEM - Apache HTTP on mw1194 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [07:05:50] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.111 second response time [07:13:07] marostegui: Please also run clean duplicates maintenance script before running the schema change [07:13:11] 06Operations, 10ops-eqiad, 10netops: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008#3269921 (10ayounsi) @Cmjohnson Yes please, and you can put the previous optic back on asw-c-eqiad. If that still doesn't solve the issue, the cable will need to be swapped. [07:13:27] it's super super unlikely but it might has one or two duplicates in it [07:13:43] Amir1: Yeah, I am not planning to do the schema change without it. [07:13:48] since jobqueue corruption can happen all the time [07:14:32] Thanks! it's mwscript extensions/ORES/maintenance/ CleanDuplicateScores.php [07:20:52] <_joe_> Amir1: what about verifying if a score is already present at the start of the job, and bail out if that's already present [07:21:00] <_joe_> as AaronSchulz suggested to you? [07:23:45] (03PS1) 10Muehlenhoff: Install Lua debug symbols [puppet] - 10https://gerrit.wikimedia.org/r/354089 [07:24:15] _joe_: hmm, it's not super hard to do [07:24:42] TBH I didn't actually understand it until you said it here [07:30:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:31:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:35:44] looks like it already recovered heh [07:37:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:38:05] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3269936 (10chasemp) [07:38:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:39:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:45:58] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3269941 (10fgiunchedi) @jcrespo yeah if it works in both cases that's good enough IMO [07:55:18] (03CR) 10TheDJ: "MP3 patents have expired but WMF doesn't want this enabled unless Legal has approved it. Legal is taking it's time. I therefor want it dis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349733 (https://phabricator.wikimedia.org/T115170) (owner: 10TheDJ) [08:00:46] (03CR) 10Dereckson: "This cautious path should be followed in your extension code too. Again, for LAME, the project maintainers stresse on the existence of pat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349733 (https://phabricator.wikimedia.org/T115170) (owner: 10TheDJ) [08:06:24] 06Operations, 10Traffic, 13Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3239715 (10Ottomata) I don't think that this should cause any problems on our side. I'm not aware of any maps specific jobs we run. @jallamandou let's remember to remove... [08:07:09] (03CR) 10Giuseppe Lavagetto: [C: 031] "good job" [software/cumin] - 10https://gerrit.wikimedia.org/r/352842 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:14:35] (03CR) 10Volans: [C: 032] Transports: add Command class [software/cumin] - 10https://gerrit.wikimedia.org/r/352842 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:15:15] (03Merged) 10jenkins-bot: Transports: add Command class [software/cumin] - 10https://gerrit.wikimedia.org/r/352842 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:15:34] (03PS2) 10Volans: Transports: use Command class for commands [software/cumin] - 10https://gerrit.wikimedia.org/r/352843 (https://phabricator.wikimedia.org/T164838) [08:18:15] (03CR) 10Volans: [C: 032] Transports: use Command class for commands [software/cumin] - 10https://gerrit.wikimedia.org/r/352843 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:18:45] (03Merged) 10jenkins-bot: Transports: use Command class for commands [software/cumin] - 10https://gerrit.wikimedia.org/r/352843 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:19:48] (03PS2) 10Alexandros Kosiaris: Use kubetcd200{1,2,3} in the kubernetes codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/354006 [08:26:57] (03PS1) 10Jcrespo: MariaDB: Repool db2062 after maintenanace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) [08:27:51] (03CR) 10Marostegui: "You might want to repool it also on the API section?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:28:34] (03PS2) 10Jcrespo: MariaDB: Repool db2062 after maintenanace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) [08:28:51] (03CR) 10Marostegui: [C: 031] MariaDB: Repool db2062 after maintenanace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:29:06] (03CR) 10Jcrespo: ""Mistakes were made"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:29:44] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3270027 (10elukey) I didn't read a lot of documentation about BBR but I am wondering if it could help in a local LAN use case like the Hadoop clu... [08:30:03] (03CR) 10Alexandros Kosiaris: [C: 032] Use kubetcd200{1,2,3} in the kubernetes codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/354006 (owner: 10Alexandros Kosiaris) [08:30:26] 06Operations, 10ops-eqiad, 15User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3270045 (10elukey) [08:33:52] (03CR) 10TheDJ: "I disagree. Any remaining patent risk is about as big for MP3, as for Ogg vorbis and other free A/V codecs that we use (although for VP go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349733 (https://phabricator.wikimedia.org/T115170) (owner: 10TheDJ) [08:35:12] (03PS1) 10Jcrespo: mariadb: Prepare db2052 for full reimage for upgrade to jessie [puppet] - 10https://gerrit.wikimedia.org/r/354094 [08:36:38] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3270053 (10akosiaris) [08:36:41] 06Operations, 10vm-requests, 05Goal, 13Patch-For-Review, 07kubernetes: Create an etcd cluster in codfw for kubernetes usage - https://phabricator.wikimedia.org/T165467#3270051 (10akosiaris) 05Open>03Resolved With https://gerrit.wikimedia.org/r/#/c/354006/ merged, the kubemasters in codfw now use the... [08:36:47] (03CR) 10TheDJ: "Oh, and the extension is just code, it's not a usage of actual codecs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349733 (https://phabricator.wikimedia.org/T115170) (owner: 10TheDJ) [08:36:54] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3150620 (10akosiaris) [08:37:19] (03CR) 10Jcrespo: [C: 032] mariadb: Prepare db2052 for full reimage for upgrade to jessie [puppet] - 10https://gerrit.wikimedia.org/r/354094 (owner: 10Jcrespo) [08:39:36] (03PS1) 10Alexandros Kosiaris: Set Type=notify for etcd systemd units [puppet] - 10https://gerrit.wikimedia.org/r/354095 [08:39:43] (03CR) 10Jcrespo: "I will leave this for you to deploy when you are ok with it, as requested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:39:58] (03CR) 10Marostegui: [C: 031] "will do!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:44:56] moritzm: Logging into Quarry using OAuth is giving a 502 error again. [08:46:16] is there a pre-existing task for that? [08:46:36] I dunno…. it’s an intermittent problem. [08:47:00] I have seen it before, but usually after a while it starts working again. [08:47:03] working for me right now... [08:47:34] thedj: Were you already logged in, or did you try to actually login just now? [08:47:40] i logged in just now [08:47:50] 502 from mediawiki.org or from wmflabs ? [08:47:53] And, lol, yeah, just worked for me… :/ [08:48:52] Er, didn’t notice the URL… I would get the popup from meta, and when I hit ‘allow’ I got the 502. [08:49:34] hmm. ok.. if you see it again, take note of that :) [08:49:40] (nods) [08:49:59] Like I said, it seems to be an intermittent thing. [08:50:37] i suspect it's wmflabs, but you never know for sure.. [08:50:41] !log Deploy alter table on codfw master (db2016) and let ir replicate - T159753 [08:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:49] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [08:55:57] 06Operations, 10ops-eqiad, 15User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3270085 (10elukey) [09:07:22] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3270122 (10elukey) With the help of @Addshore I verified that the setting should be pi... [09:14:54] ACKNOWLEDGEMENT - Varnish HTTP upload-backend - port 3128 on cp4021 is CRITICAL: connect to address 10.128.0.121 and port 3128: Connection refused Ema cp4021 is not pooled, currently used for testing purposes. [09:15:40] PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:16:43] ACKNOWLEDGEMENT - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ema cp4021 is not pooled, currently used for testing purposes. [09:25:44] (03CR) 10Filippo Giunchedi: [C: 031] raid/hpssacli: WARN on permanently disabled cache [puppet] - 10https://gerrit.wikimedia.org/r/354079 (https://phabricator.wikimedia.org/T163998) (owner: 10Faidon Liambotis) [09:25:46] (03PS1) 10Giuseppe Lavagetto: profile::calico::builder: adapt build script the new build instructions [puppet] - 10https://gerrit.wikimedia.org/r/354097 [09:28:07] (03CR) 10Filippo Giunchedi: [C: 031] raid/hpssacli: check for cable errors/no batteries [puppet] - 10https://gerrit.wikimedia.org/r/354080 (https://phabricator.wikimedia.org/T163998) (owner: 10Faidon Liambotis) [09:31:46] !log rebooting restbase2008 for update to Linux 4.9 and to pick up openjdk security updates [09:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:28] (03PS1) 10Ottomata: Add kafka_version parameter to confluent::kafka::client class [puppet] - 10https://gerrit.wikimedia.org/r/354100 [09:35:48] (03CR) 10jerkins-bot: [V: 04-1] Add kafka_version parameter to confluent::kafka::client class [puppet] - 10https://gerrit.wikimedia.org/r/354100 (owner: 10Ottomata) [09:35:58] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::calico::builder: adapt build script the new build instructions [puppet] - 10https://gerrit.wikimedia.org/r/354097 (owner: 10Giuseppe Lavagetto) [09:36:31] (03PS2) 10Ottomata: Add kafka_version parameter to confluent::kafka::client class [puppet] - 10https://gerrit.wikimedia.org/r/354100 [09:38:35] (03PS2) 10Volans: Transports: allow to specify a timeout per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352844 (https://phabricator.wikimedia.org/T164838) [09:43:19] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/354079 (https://phabricator.wikimedia.org/T163998) (owner: 10Faidon Liambotis) [09:44:19] 06Operations, 10Monitoring, 13Patch-For-Review: check_hpssacli should report on battery failures and cache disabled - https://phabricator.wikimedia.org/T163998#3270227 (10Volans) @faidon let me know if you want the Icinga RAID handler to open tasks also for warnings, these includes the above and the predicti... [09:45:13] heya _joe_, got a q about profile hiera params if you have a sec [09:45:17] " 1. Profile classes should only have parameters that default to an explicit hiera calls with no fallback value. [09:45:17] " [09:45:27] what if you want to make a profile with a lot of parameters, but have sane defaults? [09:45:46] like, you want to override it in special cases, like in labs maybe, but you usually don't need to [09:54:20] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused [09:54:30] PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:54:53] ema: hola! question for ya if you have a minute [09:55:05] (03CR) 10Filippo Giunchedi: "Is this ready to be reviewed? I see this (or a specific PS) cherry picked in beta" [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [09:55:50] PROBLEM - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:55:50] PROBLEM - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused [09:56:02] ^ fixing downtime for restbase2009, that's in order [09:56:20] nuria_: ciao! shoot :) [09:56:43] (03CR) 10Filippo Giunchedi: "This is failing to rebase on beta ATM" [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [09:58:32] ema: we have a bizarre spike on unique devices data from november 14th (where we dropped varnish3 code) to February 6th (where global last access gets deployed) . best seen here: https://goo.gl/nI2gKQ [09:58:58] !log rebooting restbase2009 for update to Linux 4.9 and to pick up openjdk security updates [09:58:59] ema: i think the root cause for it is this code: https://github.com/wikimedia/puppet/commit/0edbbabdfd76b95bdde6f26fdb4062fad6af6b69 [09:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:15] ema: that somehow affects cookie expiration times when deployed [09:59:59] ema: as if "all of a sudden" our cookies are expiring earlier than they should [10:01:17] nuria_: interesting [10:01:44] that commit should have been essentially a no-op given that all servers were already running v4 for a while [10:02:08] ema: i do not understand how this come to be but what i see in our data is that the totals do not change but the proportion of people w/o (last access) cookies [10:03:11] ema: is higher than it should be for this interval, and, once you deploy the global last access it goes back to "normal" (the proportion of users w/o cookies goes back to what it was before nov 14th) [10:03:11] thanks mor itzm [10:04:02] ema: ya, i do not understand it either, maybe the "date" is just a conincidence [10:04:10] RECOVERY - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-a valid until 2017-09-12 15:36:07 +0000 (expires in 118 days) [10:05:11] nuria_: this was the gerrit changeset https://gerrit.wikimedia.org/r/#/c/322252/, among the comments I've wisely included a link to the puppet compiler output for that change showing all VCL changes https://puppet-compiler.wmflabs.org/4612/ but the results are gone now [10:05:36] they would have been pretty useful to see the implications of that commit :) [10:05:41] ema: ahhhh wait, looking at dates, we start seeing this on teh 9th [10:05:53] nuria_: aha! [10:06:02] the change was merged on Nov 21st [10:06:03] ema: ohhh [10:06:14] ema: and in november 2-9th [10:06:26] ema: was it teh start of rollout of varnish 4? [10:06:33] let's see [10:06:48] ema: i read the sal so you know (before asking) [10:07:03] nuria_: https://gerrit.wikimedia.org/r/#/q/topic:varnish4-upgrade+(status:open+OR+status:merged) [10:08:52] ema: so it is accurate to say that varnish 4th started rolling out nov 2nd (+/- 1 day) and there is a major milestone again in november 9th in teh rollout [10:09:06] definitely [10:09:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] Rewrite the LLDP fact(s) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354084 (owner: 10Faidon Liambotis) [10:09:23] ema: ok, bingo [10:09:33] ema: now.. ahem we know the what but not the why [10:10:14] :) [10:10:27] the cookie-handling code did change quite a lot so perhaps something went wrong there [10:10:59] ema: could it have made the expiration much shorter? [10:11:13] ema: that is what the data seems to be saying [10:13:18] I wouldn't exclude that, yes! Let me create a phab so that we can look into this properly [10:18:00] ema: i did [10:18:23] ema: https://phabricator.wikimedia.org/T165560 [10:18:47] oh thanks [10:19:05] 06Operations, 06Analytics-Kanban, 10Traffic: Artificial spike in offset of unique devices from November 14th to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270402 (10ema) [10:20:11] 06Operations, 06Analytics-Kanban, 10Traffic: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270406 (10ema) [10:21:19] ema: updating now [10:23:53] 06Operations, 06Analytics-Kanban, 10Traffic: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270435 (10Nuria) Summing up from IRC's conversation between @nuria and @ema: From the 2nd of November we start seeing a shift of th... [10:24:49] 06Operations, 06Analytics-Kanban, 10Traffic: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270438 (10Nuria) {F8109003} This is the offset data for wikimedia mobile [10:26:41] !log rebooting restbase2010 for update to Linux 4.9 and to pick up openjdk security updates [10:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:32] moritz- what dependency did you add for your package so that it installs r131-2~wmf1 or future versions? [10:29:51] given that lz4 changed its numbering version? [10:30:22] any best tip so it doesn't break in the future? [10:31:11] jynus: HHVM 3.18 uses liblz4-dev (>= 0.0~r130-1) [10:31:15] 06Operations, 06Analytics-Kanban, 10Traffic: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270516 (10Nuria) [10:31:47] but for future updates it's hard to tell, not sure if lz4 even attempts to ensure some kind of soname rules [10:31:53] and I assume that would work if 1.7.X was packaged? [10:31:56] ema: super thanks, updated ticket, we think this also affects global access cookie [10:31:56] he he [10:32:04] that was my problem [10:32:29] aparently they started using v1.7.X versioning since november [10:32:29] nuria_: ok, I'll look into it! [10:32:37] https://github.com/lz4/lz4/releases [10:33:02] I will for now just copy you :-) [10:46:47] 06Operations, 06Operations-Software-Development: Puppet compiler: sync facts from all workers - https://phabricator.wikimedia.org/T165583#3270557 (10Volans) [10:47:44] !log stopping db2052 and preparing it for reimage [10:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Gerrit can't listen on port 22. It's not a root owned process (thankfully) and <1024 ports are reserved for root owned processes." [puppet] - 10https://gerrit.wikimedia.org/r/354076 (owner: 10Dzahn) [10:53:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] gerrit: let Apache proxy only listen on service IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [10:54:26] 06Operations, 10Pybal, 10Traffic, 10netops: Deploy pybal with BGP MED support (for primary/backup) in production - https://phabricator.wikimedia.org/T165584#3270574 (10mark) [10:54:57] Hi, im wondering how do i fix this. [10:54:57] W: GPG error: http://apt.wikimedia.org/wikimedia stretch-wikimedia InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 9D392D3FFADF18FB [10:57:00] paladox: is not the one documented in https://wikitech.wikimedia.org/wiki/APT_repository ? [10:57:29] Not sure [10:57:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] gerrit: dont let sshd listen on all interfaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [10:58:11] i've done the wget -O wikimedia-apt-key "https://wikitech.wikimedia.org/w/index.php?title=APT_repository/Key&action=raw" step [10:58:19] and apt-key add wikimedia-apt-key [10:58:22] !log rebooting restbase2011 for update to Linux 4.9 and to pick up openjdk security updates [10:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:53] moritzm: is stretch-wikimedia using a different GPG key? [11:00:19] it uses a newer one, yes [11:01:16] Ah, how would i get the new key please? [11:01:26] should be in puppet [11:02:44] 06Operations, 10Monitoring, 13Patch-For-Review: check_hpssacli should report on battery failures and cache disabled - https://phabricator.wikimedia.org/T163998#3270600 (10faidon) It probably shouldn't; these issues are rare enough and complex enough that it's probably better if we handle them manually for no... [11:03:23] thanks, i will go searching in puppet [11:08:39] (03CR) 10Volans: [C: 032] Transports: allow to specify a timeout per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352844 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [11:09:01] 06Operations, 10Scap: Upload the scap package to stretch-wikimedia - https://phabricator.wikimedia.org/T165586#3270633 (10Paladox) [11:09:20] (03Merged) 10jenkins-bot: Transports: allow to specify a timeout per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352844 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [11:11:10] (03PS1) 10Volans: Puppet compiler: automatically sync from all masters [puppet] - 10https://gerrit.wikimedia.org/r/354105 (https://phabricator.wikimedia.org/T165583) [11:14:32] (03PS1) 10Faidon Liambotis: Use str2bool() for the is_virtual fact [puppet] - 10https://gerrit.wikimedia.org/r/354106 [11:15:59] (03PS1) 10Giuseppe Lavagetto: role::aqs: use profile::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/354107 [11:17:18] !log rebooting restbase2012 for update to Linux 4.9 and to pick up openjdk security updates [11:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:09] (03PS2) 10Faidon Liambotis: Rewrite the LLDP fact(s) [puppet] - 10https://gerrit.wikimedia.org/r/354084 [11:18:11] (03PS1) 10Faidon Liambotis: Do not confine LLDP fact to physical/non-VMs [puppet] - 10https://gerrit.wikimedia.org/r/354108 [11:18:54] (03CR) 10Alexandros Kosiaris: [C: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/354045 (owner: 10Paladox) [11:18:59] (03PS3) 10Alexandros Kosiaris: redis: Remove support for precise [puppet] - 10https://gerrit.wikimedia.org/r/354045 (owner: 10Paladox) [11:19:02] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] redis: Remove support for precise [puppet] - 10https://gerrit.wikimedia.org/r/354045 (owner: 10Paladox) [11:19:08] Thanks ^^ [11:20:22] (03PS7) 10Paladox: HHVM: Fix puppet on trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) [11:21:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] Rewrite the LLDP fact(s) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354084 (owner: 10Faidon Liambotis) [11:21:41] (03CR) 10Alexandros Kosiaris: [C: 031] Rewrite the LLDP fact(s) [puppet] - 10https://gerrit.wikimedia.org/r/354084 (owner: 10Faidon Liambotis) [11:22:11] (03CR) 10Alexandros Kosiaris: [C: 031] Do not confine LLDP fact to physical/non-VMs [puppet] - 10https://gerrit.wikimedia.org/r/354108 (owner: 10Faidon Liambotis) [11:22:44] (03CR) 10Faidon Liambotis: Rewrite the LLDP fact(s) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354084 (owner: 10Faidon Liambotis) [11:23:22] (03CR) 10Faidon Liambotis: [C: 032] "PCC says no diff: http://puppet-compiler.wmflabs.org/6458/" [puppet] - 10https://gerrit.wikimedia.org/r/354106 (owner: 10Faidon Liambotis) [11:23:41] (03PS2) 10Faidon Liambotis: Use str2bool() for the is_virtual fact [puppet] - 10https://gerrit.wikimedia.org/r/354106 [11:27:56] !log uploaded php-luasandbox_2.0.12~jessie3 to apt.wikimedia.org (adds a separate debug package hhvm-luasandbox-dbg) [11:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:10] 06Operations, 10media-storage, 13Patch-For-Review, 15User-fgiunchedi: Implement storage policies for swift - https://phabricator.wikimedia.org/T151648#3270743 (10fgiunchedi) I've cherry picked https://gerrit.wikimedia.org/r/353878 in beta and tested on the swift 2.10 cluster there and added a corresponding... [12:03:12] (03PS2) 10Muehlenhoff: Install Lua debug symbols [puppet] - 10https://gerrit.wikimedia.org/r/354089 [12:08:43] (03CR) 10Muehlenhoff: [C: 032] Install Lua debug symbols [puppet] - 10https://gerrit.wikimedia.org/r/354089 (owner: 10Muehlenhoff) [12:22:01] 06Operations, 06Analytics-Kanban, 10Traffic: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270816 (10Nuria) @ema: has the way we compute nocookies flag on X-analytics changed? It should take into account "all" cookies not... [12:35:06] (03PS1) 10Brian Wolff: Harden zerowiki config (no raw html, no transclude NS_ZERO) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354113 (https://phabricator.wikimedia.org/T162771) [12:42:49] !log Deploy alter table on s2.trwiki directly on codfw master (db2017) after running the clean up duplicates script - T164530 [12:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:57] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [12:48:30] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3270916 (10BBlack) There's not a lot of good data on how BBR behaves in datacenter-like networks (high bandwidth, low latency, low loss, etc). I... [12:49:12] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3270917 (10BBlack) Also while I'm thinking about it - we should validate that the sysctl setting for fq as default qdisc "sticks" on reboot and i... [12:57:22] 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Enable Kafka TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#3270926 (10Ottomata) [12:57:30] 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Enable Kafka TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881737 (10Ottomata) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170517T1300). [13:00:04] bawolff: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:15] * bawolff waves [13:05:13] 06Operations, 06Performance-Team, 10Thumbor, 05MW-1.30-release-notes (WMF-deploy-2017-05-23_(1.30.0-wmf.2)), 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3270944 (10Gilles) TIFF support for orient... [13:07:40] PROBLEM - puppet last run on cp4021 is CRITICAL: Return code of 255 is out of bounds [13:10:50] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:10:56] Hello [13:11:59] "I believe all uses of raw html have been removed at this point." < any idea to test that? [13:13:40] I use insource:html in the search box [13:14:01] which only returns archived pages [13:14:13] and i manually tested this site [13:14:27] (03CR) 10Volans: [C: 031] "LGTM although given the change in preferred_lft to the authdns lo alias and their close timing in puppet runs, I'd suggest to disable pupp" [puppet] - 10https://gerrit.wikimedia.org/r/354073 (owner: 10Faidon Liambotis) [13:14:30] ok [13:14:34] Also, about a month ago I previously broke raw html on that wiki, so any raw html that is left won't work [13:14:38] (03CR) 10Dereckson: [C: 032] Harden zerowiki config (no raw html, no transclude NS_ZERO) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354113 (https://phabricator.wikimedia.org/T162771) (owner: 10Brian Wolff) [13:16:14] (03Merged) 10jenkins-bot: Harden zerowiki config (no raw html, no transclude NS_ZERO) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354113 (https://phabricator.wikimedia.org/T162771) (owner: 10Brian Wolff) [13:16:22] (03CR) 10jenkins-bot: Harden zerowiki config (no raw html, no transclude NS_ZERO) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354113 (https://phabricator.wikimedia.org/T162771) (owner: 10Brian Wolff) [13:16:29] oh the associated bug is private, i should make that public (there's nothing secret on it) [13:18:46] 06Operations, 06Performance-Team, 10Thumbor, 05MW-1.30-release-notes (WMF-deploy-2017-05-23_(1.30.0-wmf.2)), 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3270972 (10Gilles) I've checked and PdfHan... [13:19:06] bawolff: live on mwdebug1002.eqiad.wmnet [13:19:14] ok, testing [13:19:28] 06Operations, 07HHVM: HHVM 3.18 crash on job runner / luasandbox - https://phabricator.wikimedia.org/T165043#3270973 (10MoritzMuehlenhoff) The crashes are not related/fallout of the segfault fixed in T162586. I checked our servers which have been upgraded to the version and we've seen three crashes related to... [13:19:58] 06Operations, 07HHVM: HHVM 3.18 crashes in luasandbox - https://phabricator.wikimedia.org/T165043#3270974 (10MoritzMuehlenhoff) [13:20:07] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#2696835 (10ayounsi) >>! In T147569#3270027, @elukey wrote: > where LibreNMS periodically notifies us that the switch ports are saturated due to s... [13:22:25] Dereckson: ok, I tested through the normal user workflows, and everything seems to work :) [13:22:31] So it should be good to make it cluster wide [13:23:12] syncing [13:23:49] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Harden zerowiki config (T162771) (duration: 00m 41s) [13:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:58] T162771: Zerowiki is broken by filtering - https://phabricator.wikimedia.org/T162771 [13:26:23] hiho [13:26:31] Dereckson: Thank you :) [13:26:45] I have 1 or 2 things to add to swat! [13:26:55] but I can do them if you would like Dereckson ! [13:28:25] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3270985 (10elukey) Thanks for the feedback, I thought it was more a problem of port capacity completely used (100%) and buffers filled, now it ma... [13:30:20] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:31:18] Dereckson: okay with you? I believe the first bit of swat is done> [13:31:19] ? [13:32:00] 06Operations, 10Traffic, 10netops: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#2779434 (10ema) Current situation: |host | port | switch | port | redundancy issues | |lvs1007 | eth0 | asw2-a5-eqiad | xe-0/0/8.0 | lvs1010 eth1 also on asw2-a5... [13:33:33] addshore: okay, you can deploy them [13:33:57] sorry I was preparing an USB keys with my tickets to print for hackathon travel [13:34:09] RECOVERY - Hadoop NodeManager on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:34:19] RECOVERY - Hadoop DataNode on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [13:34:42] (03PS1) 10Addshore: Take RevisionSlider out of beta on all sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354116 (https://phabricator.wikimedia.org/T163685) [13:35:25] 06Operations, 10Traffic, 10netops: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3271021 (10BBlack) Also notable: lvs1009 and lvs1012 connections to row B (eth2) are using 1GbE ports rather than 10GbE? [13:35:36] ack! [13:36:25] (03PS2) 10Addshore: Take RevisionSlider out of beta on all sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354116 (https://phabricator.wikimedia.org/T163685) [13:36:32] wmmmmm [13:36:54] (03CR) 10Addshore: [C: 032] Take RevisionSlider out of beta on all sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354116 (https://phabricator.wikimedia.org/T163685) (owner: 10Addshore) [13:37:28] ohia elukey :P [13:37:36] addshore: o/ [13:38:30] why the hadoop daemons started again on 1030? grrr [13:38:36] RECOVERY - NTP on analytics1030 is OK: NTP OK: Offset -0.001350015402 secs [13:39:00] (03Merged) 10jenkins-bot: Take RevisionSlider out of beta on all sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354116 (https://phabricator.wikimedia.org/T163685) (owner: 10Addshore) [13:39:08] * Reedy hands elukey a `sudo kill -9` stick [13:39:08] (03CR) 10jenkins-bot: Take RevisionSlider out of beta on all sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354116 (https://phabricator.wikimedia.org/T163685) (owner: 10Addshore) [13:40:47] * elukey thanks Reedy for the kind offer [13:41:01] !log upgrading mw1261-mw1265 to new hhvm-luasandbox/hhvm-luasandbox-dbg packages [13:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:08] 06Operations, 10ops-eqiad: analytics1030 failed bbu - https://phabricator.wikimedia.org/T165529#3271040 (10Cmjohnson) Create Service Request: Service Tag 9BJKV12 Confirmed: Request 948416019 was successfully submitted. Your service request has been successfully created and will be reviewed by our team. A Del... [13:42:33] !log shutdown analytics1030 for T165529 [13:42:35] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:354116|Take RevisionSlider out of beta on all sites]] T163685 PT 1/2 (duration: 00m 40s) [13:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:40] T165529: analytics1030 failed bbu - https://phabricator.wikimedia.org/T165529 [13:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:50] T163685: Take RevisionSlider out of beta for Wikis - https://phabricator.wikimedia.org/T163685 [13:43:11] @elukey: sorry ...i thought this was taken down for maintenance.... [13:44:09] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT [[gerrit:354116|Take RevisionSlider out of beta on all sites]] NOOP PT 2/2 (duration: 00m 39s) [13:44:10] cmjohnson1: no worries! I disabled all the daemons and puppet yesterday but for some reason they started again :( [13:44:16] so I just shut it down [13:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:01] !log addshore@tin Synchronized php-1.30.0-wmf.1/extensions/TwoColConflict/modules/: SWAT [[gerrit:354009|Fix issues with column alignment]] T165129 (duration: 00m 39s) [13:47:06] !log replacing optics on cr1-3/1/2 and/or asw-c-eqiad:xe-8/0/38 T165008 [13:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:08] T165129: Help text of editor column covered by WikiEditor - https://phabricator.wikimedia.org/T165129 [13:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:16] T165008: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008 [13:48:07] swat done [13:48:14] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3225440 (10akosiaris) I 've had a look at this. I can't reliably reproduce anything yet. Things I 've already completely ruled out * Service restarts (service icin... [13:49:00] 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Enable Kafka TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#3271107 (10Ottomata) It looks like a lot of the hard work for this has been done for Cassandra over in T108953 and T111113. Documentation for this is... [13:55:04] (03CR) 10Mark Bergsma: Add unit tests for DNSQueryMonitoringProtocol (031 comment) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/343655 (owner: 10Ema) [13:57:00] 06Operations, 10Scap: Upload the scap package to stretch-wikimedia - https://phabricator.wikimedia.org/T165586#3270633 (10fgiunchedi) @paladox can you try again? the package should be available now [13:57:53] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:58:26] 06Operations, 10Scap: Upload the scap package to stretch-wikimedia - https://phabricator.wikimedia.org/T165586#3271166 (10Paladox) 05Open>03Resolved a:03fgiunchedi @fgiunchedi thanks. It installed now :) root@phab-tin:/home/paladox# scap usage: scap [-h] ... scap: error: too few arguments [14:17:32] !log demon@tin Pruned MediaWiki: 1.29.0-wmf.19 [keeping static files] (duration: 00m 12s) [14:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:33] !log demon@tin Pruned MediaWiki: 1.29.0-wmf.19 (duration: 01m 07s) [14:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:01] !log upgrading mw1170-mw1179 to new hhvm-luasandbox/hhvm-luasandbox-dbg packages [14:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:56] !log demon@tin Pruned MediaWiki: 1.29.0-wmf.21 [keeping static files] (duration: 00m 22s) [14:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:39] !log demon@tin Synchronized README: No-op, forcing co-master sync (duration: 00m 40s) [14:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:59] (03PS3) 10Ema: Add unit tests for DNSQueryMonitoringProtocol [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/343655 [14:31:13] (03CR) 10Ema: Add unit tests for DNSQueryMonitoringProtocol (031 comment) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/343655 (owner: 10Ema) [14:34:19] 06Operations, 10ops-eqiad, 10netops: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008#3271277 (10Cmjohnson) @ayounsi Changing the optics on cr1 appears to have worked..no new errors. Please review and resolve if you're satisfied. cmjohnson@asw-c-eqiad> show interfaces x... [14:43:24] (03CR) 10Mark Bergsma: [C: 032] Add unit tests for DNSQueryMonitoringProtocol [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/343655 (owner: 10Ema) [14:43:57] (03Merged) 10jenkins-bot: Add unit tests for DNSQueryMonitoringProtocol [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/343655 (owner: 10Ema) [14:46:22] (03CR) 10Faidon Liambotis: "It will still be against "lo". The preferred_lft change is not going to happen at runtime, but it's an interesting change indeed. This def" [puppet] - 10https://gerrit.wikimedia.org/r/354073 (owner: 10Faidon Liambotis) [14:50:03] !log Deploy alter table on s2.revision table on db1069 - T162611 [14:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:12] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [14:57:26] (03PS2) 10Giuseppe Lavagetto: role::aqs: use profile::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/354107 [14:58:01] <_joe_> ema: I would expect you would do https://gerrit.wikimedia.org/r/343655 against master and then cherry-pick it on 1.13 [15:05:45] (03PS3) 10Giuseppe Lavagetto: role::aqs: use profile::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/354107 [15:08:22] !log upgrading mw1189-mw1199 to new hhvm-luasandbox/hhvm-luasandbox-dbg packages [15:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:02] (03PS4) 10Giuseppe Lavagetto: role::aqs: use profile::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/354107 [15:35:30] _joe_: oh, I thought we'd put new changes into 1.13 and then merge them back into master once we start working on a new release [15:35:38] clearly something to discuss in vienna and agree upon :) [15:35:42] <_joe_> yeah [15:42:39] (03PS5) 10Giuseppe Lavagetto: role::aqs: use profile::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/354107 [15:45:35] (03PS6) 10Mforns: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [15:45:52] (03CR) 10Filippo Giunchedi: "cherry-picked in beta and seems to work, though puppet currently fails on deployment-imagescaler01 due to https://gerrit.wikimedia.org/r/#" [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [15:46:33] (03CR) 10jerkins-bot: [V: 04-1] [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [15:48:05] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/6468/ this would add quite a few new things to the cassandra in aqs; I'd argue it's a good thing, but " [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto) [15:52:09] (03CR) 10Giuseppe Lavagetto: graphite::alerts: add alerting on session loss (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/350555 (owner: 10Giuseppe Lavagetto) [15:52:28] (03PS6) 10Giuseppe Lavagetto: graphite::alerts: add alerting on session loss [puppet] - 10https://gerrit.wikimedia.org/r/350555 [15:53:58] (03CR) 10Giuseppe Lavagetto: [C: 032] graphite::alerts: add alerting on session loss [puppet] - 10https://gerrit.wikimedia.org/r/350555 (owner: 10Giuseppe Lavagetto) [16:04:13] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3271522 (10elukey) Ticket closed as won't fix. The main issue is that not all the clients are sending the close notify, and nginx follows what the majority of the browser... [16:21:37] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3271570 (10BBlack) I wonder if Chrome (which is the dominant browser now, not MSIE as indicated in that nginx source comment) sends the close notify? [16:25:06] RECOVERY - Check systemd state on cp4021 is OK: OK - running: The system is fully operational [16:25:26] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp4021 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.147 second response time [16:26:19] (03PS1) 10BBlack: cp4021: test bbr settings through reboot [puppet] - 10https://gerrit.wikimedia.org/r/354124 [16:26:30] (03CR) 10BBlack: [V: 032 C: 032] cp4021: test bbr settings through reboot [puppet] - 10https://gerrit.wikimedia.org/r/354124 (owner: 10BBlack) [16:36:19] 06Operations, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271620 (10ema) [16:36:43] 06Operations, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271633 (10ema) p:05Triage>03Normal [16:38:05] 06Operations, 10Traffic, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271620 (10ema) [16:42:35] !log upgrading mw2120-mw2129 to Linux 4.9 and HHVM 3.18 [16:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:27] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3271654 (10BBlack) On the reboot issue: I've tested cp4021 and the existing puppetization works fine on reboot (even given the other stuff below)... [16:49:56] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2081235 [16:51:23] 06Operations, 10Traffic, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271620 (10BBlack) So, a couple points: 1. Probably the reason for a lack of neighbors is that some (most?) of the switches don't blanket-enable LLDP for all ports. They explicitly list certain groups like `i... [16:52:21] (03PS3) 10Jforrester: Enable wgCiteResponsiveReferences on ilowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351459 (https://phabricator.wikimedia.org/T164230) (owner: 10Framawiki) [16:52:23] (03PS1) 10Jforrester: Enable wgCiteResponsiveReferences on mswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354127 (https://phabricator.wikimedia.org/T165247) [16:52:33] !log restarting varnish backend on cp1099 (mailbox) [16:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:35] (03Abandoned) 10Dzahn: gerrit: switch SSHD from port 29418 to 22 [puppet] - 10https://gerrit.wikimedia.org/r/354076 (owner: 10Dzahn) [16:59:56] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [17:00:10] (03PS2) 10Volans: Transports: allow to specify exit codes per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352845 (https://phabricator.wikimedia.org/T164838) [17:01:14] 06Operations: Audit / document reasons for not enabling HT? - https://phabricator.wikimedia.org/T165618#3271722 (10BBlack) [17:03:58] 06Operations: Audit / document reasons for not enabling HT? - https://phabricator.wikimedia.org/T165618#3271722 (10jcrespo) Many of those are virtual machines. [17:04:26] 06Operations: Audit / document reasons for not enabling HT? - https://phabricator.wikimedia.org/T165618#3271722 (10MoritzMuehlenhoff) Quite a number of those Ganeti hosts [17:04:49] I beat you by 1 minute! [17:07:14] :-) [17:07:50] ah, there must be a facter way to filter those [17:07:52] or something [17:08:46] then on some services, there are clear proofs that enabling it is both better and worse [17:08:52] systemd-detect-virt for jessie at least [17:09:02] mysql fails on stretch now. [17:09:10] ah, and we have a puppet fact "virtual" [17:09:26] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:09:56] but independently of that, I wonder why it varies: there are both very old and very new hosts there [17:09:57] yeah [17:10:13] even from the same batch and installed at the same time [17:10:16] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [17:10:16] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.074 second response time [17:10:35] the inconsistency worries me more than the actual efects [17:10:35] re-auditing and skipping virtuals, will amend task :) [17:11:32] in think it most cases these are just inconsistent firmware states/config settings, e.g. for restbase there's already a task to straighten out the remaining two stragglers: https://phabricator.wikimedia.org/T162735 [17:12:44] 06Operations: Audit / document reasons for not enabling HT? - https://phabricator.wikimedia.org/T165618#3271766 (10BBlack) [17:12:57] 06Operations: Audit / document reasons for not enabling HT? - https://phabricator.wikimedia.org/T165618#3271722 (10BBlack) Edited the top part, re-ran excluding virtuals. [17:13:16] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [17:15:51] some of them may be those faulty hosts that were dropping cpu speed, where we ended up blacklisting a module [17:16:05] I think at some point we probably disabled HT while debugging or something, in those cases [17:19:31] 06Operations: Audit / document reasons for not enabling HT? - https://phabricator.wikimedia.org/T165618#3271782 (10jcrespo) There is some merit to disabling HT in mysql- because we used to buy hosts with too many cores, and those were not really used, there is some merit to some claims where that would be not go... [17:20:53] another thing is, how reliable is the check? [17:21:31] and yes, I am seeing it now, but the question is how reliable is puppet for that? [17:23:56] (03Draft1) 10Paladox: mysql: Fix installing package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/354131 [17:23:58] (03PS2) 10Paladox: mysql: Fix installing package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/354131 [17:25:24] (03PS3) 10Paladox: mysql: Fix installing package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/354131 [17:25:46] (03PS4) 10Paladox: mysql: Fix installing package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/354131 [17:28:02] 06Operations: Audit / document reasons for not enabling HT? - https://phabricator.wikimedia.org/T165618#3271807 (10BBlack) I think that almost universally, HT is a win for the host as a whole. There's always more things going on than there are cpu cores. If nothing else, picture it in your head as "puppet agen... [17:30:33] 06Operations, 10Traffic, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271620 (10faidon) It's worse than that I'm afraid :( LLDP regularly crashes on some of our older switches (running ancient JunOS): ``` # asw2-a5-eqiad> show system core-dumps # fpc0: # ------------------... [17:40:22] 06Operations: Audit / document reasons for not enabling HT? - https://phabricator.wikimedia.org/T165618#3271846 (10jcrespo) Alternative `$(lscpu | grep "Thread(s) per core:" | sed -e "s/Thread(s) per core: //")` check (sorry, I had to do it) throws the same results: ``` (150) auth2001.codfw.wmnet,bast[1001,... [17:40:51] 06Operations, 10Traffic, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3271850 (10BBlack) Answering for @ema I think this mostly came up as a consequence of trying to map out the data in T150256#3271004 using lldpcli to confirm port connections. That led to an in-depth conversati... [17:56:20] (03CR) 10Paladox: [C: 031] "This fixes the puppet run now :)" [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox) [17:58:17] 06Operations, 10Gerrit: Upload gerrit package to stretch apt.wm.org repo - https://phabricator.wikimedia.org/T165620#3271896 (10Paladox) [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170517T1800). [18:00:05] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:10] * James_F waves. [18:04:53] 06Operations, 07Zuul: Add a stretch debian package for zuul - https://phabricator.wikimedia.org/T165621#3271912 (10Paladox) [18:05:29] (03CR) 10Jcrespo: "This package is basically an import of https://forge.puppet.com/puppetlabs/mysql/changelog , probably a much better idea would be to upgra" [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox) [18:06:49] (03CR) 10Paladox: [C: 031] "> This package is basically an import of https://forge.puppet.com/puppetlabs/mysql/changelog" [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox) [18:08:26] !log reprepro include facter 2.4.6 to jessie-wikimedia/trusty-wikimedia [18:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:14] Hello. [18:15:33] (03CR) 10Dereckson: [C: 032] Enable wgCiteResponsiveReferences on ilowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351459 (https://phabricator.wikimedia.org/T164230) (owner: 10Framawiki) [18:16:20] Thanks, Dereckson. You can do the two configs together. [18:16:36] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on ilowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351459 (https://phabricator.wikimedia.org/T164230) (owner: 10Framawiki) [18:19:11] (03CR) 10Dereckson: [C: 032] Enable wgCiteResponsiveReferences on mswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354127 (https://phabricator.wikimedia.org/T165247) (owner: 10Jforrester) [18:20:07] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on mswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354127 (https://phabricator.wikimedia.org/T165247) (owner: 10Jforrester) [18:21:15] James_F: live on mwdebug1002.eqiad.wmnet [18:21:20] (both) [18:21:24] Ta. [18:21:28] (03CR) 10jenkins-bot: Enable wgCiteResponsiveReferences on ilowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351459 (https://phabricator.wikimedia.org/T164230) (owner: 10Framawiki) [18:21:30] (03CR) 10jenkins-bot: Enable wgCiteResponsiveReferences on mswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354127 (https://phabricator.wikimedia.org/T165247) (owner: 10Jforrester) [18:21:50] Something at least is fast today (the post-merge config task to pickup changes to beta) [18:22:33] Ha, yeah. [18:23:12] Dereckson: Yeah, both LGTM. [18:23:28] okay, let's sync [18:24:04] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable wgCiteResponsiveReferences on ilo. and ms.wikipedia (T164230, T165247) (duration: 00m 39s) [18:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:14] T165247: Convert reference lists over to `responsive` on mswiki - https://phabricator.wikimedia.org/T165247 [18:24:14] T164230: Convert reference lists over to `responsive` on ilowiki - https://phabricator.wikimedia.org/T164230 [18:25:12] https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55-trusty/3572/console tests are still running for the VE Change [18:25:56] Yup, they take an age. [18:26:05] (03CR) 10Jcrespo: "I am going to create a task, we have been talking and we are not 100% sure this class is going to survive on stretch and may be deprecated" [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox) [18:26:48] Back in the good old days when the gerrit VE submodule update job didn't work, I'd pre-merge the wmf.X item and manually make the MW-core submodule update, which was a pain but had the advantage of making this tedious wait something that didn't hold up the SWAT. [18:26:52] Progress! ;-) [18:27:06] (03CR) 10Paladox: [C: 031] "> I am going to create a task, we have been talking and we are not" [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox) [18:28:00] James_F: live on mwdebug1002 [18:29:42] Dereckson: Yup, works a treat. [18:30:14] ack'ed [18:30:50] !log dereckson@tin Synchronized php-1.30.0-wmf.1/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js: Do not check for visual editor availability when loading source editor ([[Gerrit:354126]]) (duration: 00m 39s) [18:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:05] (03PS5) 10Paladox: mysql: Fix installing package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/354131 [18:31:53] Dereckson: Thanks so much. :-) [18:32:06] You're welcome. [18:36:43] (03PS8) 10Dzahn: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 [18:38:12] (03CR) 10Paladox: [C: 031] phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [18:39:21] 06Operations, 10Community-Wikimetrics, 10DBA, 10Icinga, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#3272009 (10jcrespo) [18:41:04] (03CR) 10Jcrespo: "Check T165625 and maybe we can reach a conclusion to move forward." [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox) [18:41:12] thanks ^^ [18:42:19] 06Operations, 10Community-Wikimetrics, 10DBA, 10Icinga, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#3272026 (10Paladox) I vote to migrate to use the mariadb class as it is maintained :) [18:44:36] 06Operations, 10Community-Wikimetrics, 10DBA, 10Icinga, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#3272032 (10jcrespo) Paladox- if this is for your own usage- you can do that now, what I cannot guarantee is that it will fullfill your needs easily... [18:45:47] 06Operations, 10Community-Wikimetrics, 10DBA, 10Icinga, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#3272033 (10Paladox) Oh, i think the puppet role I'm using right now (deployment_server) is using mysql. Since somehow it is using the mysql class. [18:46:03] have you noticed how all the puppet roles are "role-role-something" .. we should not repeat the role word in the name [18:46:46] (03PS1) 10Aaron Schulz: Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) [18:47:03] (03CR) 10jerkins-bot: [V: 04-1] Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [18:52:39] (03PS9) 10Dzahn: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 [18:54:34] !log T164865: restarting RESTBase in dev env to apply range-delete probability bug-fix [18:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:42] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [18:55:33] (03PS10) 10Paladox: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [18:56:10] (03PS11) 10Dzahn: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 [18:56:54] (03PS12) 10Dzahn: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 [18:58:15] (03CR) 10Dzahn: [C: 032] "only diff is the motd and in labs it has already been running like this for about a week http://puppet-compiler.wmflabs.org/6473/iridium." [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [18:58:59] paladox: no "role-role" for once, heh http://puppet-compiler.wmflabs.org/6473/iridium.eqiad.wmnet/ [18:59:08] lol yeh [18:59:12] it created a new file for me [18:59:19] when i re applied your latest change [18:59:30] Notice: /Stage[main]/Role::Phabricator_server/System::Role[phabricator_server]/Motd::Script[role-phabricator_server]/File[/etc/update-motd.d/05-role-phabricator-server]/ensure: created [18:59:42] paladox: yea, that's the message you see when you login [18:59:49] that's the only change [18:59:54] oh [19:00:14] system::role does that and adds it to motd (message of the day) [19:00:26] so if the role name changes, the banner changes [19:02:03] paladox: let me know if you see any unexpected shinken alerts .. but afaict all the phab instances already use the new role name [19:02:12] Ok [19:02:13] yep [19:02:14] right, you changed them last week [19:02:24] Yep [19:02:27] cool [19:02:44] on prod it's all noop [19:03:13] Yep : [19:03:17] :) [19:03:49] puppet passes again [19:04:01] except the thing you mentioned. drop the "role-role" snippet. [19:04:07] ok great, and when you login it says now "iridium is a Phabricator (Main) (role::phabricator::main) [19:04:21] bbiaw then [19:04:31] yep [19:04:32] phabricator is a Phabricator (Main) Server (phabricator_server) [19:04:55] mutante which class did you apply to iridium? [19:05:09] role::phabricator::main sounds like the old class [19:05:14] 1241 node /^(iridium\.eqiad|phab1001\.eqiad|phab2001\.codfw)\.wmnet$/ { [19:05:17] 1242 role(phabricator_server) [19:05:48] thanks, yep :) [19:05:51] paladox: yes, it does... i think it's an unrelated bug [19:05:56] about updating the motd [19:05:57] Ok [19:05:59] that we noticed before [19:06:03] oh [19:06:15] but that config snippet is now correct as opposed to before [19:06:24] regarding the role-role part [19:06:29] ok [19:06:31] :) [19:17:23] (03PS2) 10Aaron Schulz: Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) [19:21:09] !log mr1-ulsfo replacement underway [19:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:18] 06Operations: puppet mechanism updating motd is broken - https://phabricator.wikimedia.org/T80998#3272113 (10Dzahn) [19:24:30] 06Operations, 10ops-ulsfo, 06DC-Ops, 10netops: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3272118 (10RobH) [19:25:56] 06Operations: puppet mechanism updating motd is broken - https://phabricator.wikimedia.org/T80998#882211 (10Dzahn) also T45954. not sure yet, but it seems like this bug might be back [19:26:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:26:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:26:56] PROBLEM - Host mr1-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.194) [19:27:16] PROBLEM - Host asw-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [19:27:17] 06Operations: fix "packages can be updated"-message in motd - https://phabricator.wikimedia.org/T81837#3272125 (10Dzahn) [19:28:16] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:863:ffff::6) [19:29:46] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:29:46] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [19:29:56] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:30:02] ^ maintenance at ULSFO [19:33:07] (03PS2) 10Dzahn: debug_proxy: move 'standard' and 'base::firewall' to role [puppet] - 10https://gerrit.wikimedia.org/r/353361 [19:33:56] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:34:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:35:36] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:35:43] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/6474/" [puppet] - 10https://gerrit.wikimedia.org/r/353361 (owner: 10Dzahn) [19:36:46] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:36:55] uh [19:37:22] how is mr1 causing 5xx? [19:37:37] maybe just coincidence? [19:37:49] oh there we go: 19:27 < icinga-wm> PROBLEM - Host asw-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [19:39:38] (03PS4) 10Dzahn: mariadb: clean up duplicate GRANTs for phstats user [puppet] - 10https://gerrit.wikimedia.org/r/348779 [19:42:58] ACKNOWLEDGEMENT - HP RAID on db2058 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor - Failed: 1I:1:3 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T165629 [19:43:02] 06Operations, 10ops-codfw: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T165629#3272144 (10ops-monitoring-bot) [19:43:46] (03PS2) 10Dzahn: graphite: move 'standard' and 'base::firewall' to role [puppet] - 10https://gerrit.wikimedia.org/r/353364 [19:44:10] (03CR) 10Dzahn: [C: 031] "no-op http://puppet-compiler.wmflabs.org/6475/" [puppet] - 10https://gerrit.wikimedia.org/r/353364 (owner: 10Dzahn) [19:46:16] PROBLEM - puppet last run on ms-fe1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:55:56] PROBLEM - swift-account-server on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:56] PROBLEM - swift-container-replicator on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:56] PROBLEM - swift-object-updater on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:56:06] PROBLEM - swift-container-auditor on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:56:46] RECOVERY - swift-account-server on ms-be1019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [19:56:46] RECOVERY - swift-container-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [19:56:47] RECOVERY - swift-object-updater on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [19:56:56] RECOVERY - swift-container-auditor on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170517T2000). Please do the needful. [20:08:44] 06Operations, 10Gerrit: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631#3272220 (10Dzahn) [20:11:23] 06Operations, 10Gerrit: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631#3272234 (10Dzahn) unrelatedly: yea, would also be nice to reinstall cobalt as gerrit1001, wouldn't it? subtask? [20:14:16] RECOVERY - puppet last run on ms-fe1008 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:19:01] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 06Release-Engineering-Team (Backlog): Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#3272251 (10Dzahn) update: the current status is: ``` 230 # CI master / CI standby (switch in Hiera) 231 node... [20:22:52] no ores for today [20:26:36] PROBLEM - swift-container-updater on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:27:26] RECOVERY - swift-container-updater on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:10:56] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 56.46 ms [21:17:46] RECOVERY - Host asw-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.49 ms [21:17:46] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.24 ms [21:21:56] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.83 ms [21:42:31] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/6476/" [puppet] - 10https://gerrit.wikimedia.org/r/354075 (owner: 10Dzahn) [21:49:36] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:00:11] !log T164865: altering compaction strategy to sizetiered, local_group_wikipedia_T_parsoid_html.data (in RESTBase dev) [22:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:18] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [22:17:36] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [22:19:09] (03PS1) 10Dzahn: fix all the "role-role" in system::roles [puppet] - 10https://gerrit.wikimedia.org/r/354172 [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170517T2300). Please do the needful.