[00:20:47] !log catrope@naos Synchronized php-1.29.0-wmf.21/extensions/Flow/modules/flow/ui/widgets/mw.flow.ui.ReplyWidget.js: T163749 (duration: 01m 24s) [00:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:57] T163749: [wmf.21-regression] Posting a reply to Flow topics with multiple posts redirects users to no-js mode - https://phabricator.wikimedia.org/T163749 [00:46:03] (03CR) 10Jforrester: "To SWAT this week, after the train on Thursday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349146 (https://phabricator.wikimedia.org/T162760) (owner: 10Catrope) [00:50:58] (03CR) 10Jforrester: Enable RCFilters beta feature on all remaining wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [00:51:07] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [00:53:15] So, who would be able to delete some old repositories? FR-tech has at least a dozen we no longer need! [00:53:50] Hmmm Gerrit is written in Java--isn't that supposed to include garbage collection? [00:53:52] jk [00:53:57] haha [00:54:26] * AndyRussG smirks following his snark ambush [00:54:39] ;p [00:55:47] AndyRussG: we turned that off.. it slowed Gerrit down :p [00:56:05] (not even kidding) [00:56:07] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 15 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [00:56:14] heheh aww [00:56:34] should've used C [00:57:17] ejegg: i think the usual practice is you just make one last commit deleting all the things [00:57:36] if you think it should really be gone [04:03:26] (03PS1) 10BryanDavis: Always cleanup manifest on stop [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/350362 (https://phabricator.wikimedia.org/T163355) [04:08:35] (03PS2) 10BryanDavis: Always cleanup manifest on stop [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/350362 (https://phabricator.wikimedia.org/T163355) [04:10:02] (03PS3) 10BryanDavis: Always cleanup manifest on stop [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/350362 (https://phabricator.wikimedia.org/T163355) [04:29:51] (03PS1) 10Dzahn: lift IP throttle for event at high school in Jesi,Italy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) [04:32:47] (03PS2) 10Dzahn: lift IP throttle for event at high school in Jesi,Italy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) [04:32:50] (03CR) 10TerraCodes: "Should commons be added?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) (owner: 10Dzahn) [04:34:41] (03PS3) 10Dzahn: lift IP throttle for event at high school in Jesi,Italy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) [04:35:57] (03PS4) 10Dzahn: lift IP throttle for event at high school in Jesi,Italy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) [04:39:05] (03CR) 10TerraCodes: [C: 031] lift IP throttle for event at high school in Jesi,Italy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) (owner: 10Dzahn) [04:41:00] (03PS5) 10Dzahn: lift IP throttle for event at high school in Jesi,Italy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) [04:46:05] (03Abandoned) 10KartikMistry: Remove redundant setting from cxsave wgRateLimits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349201 (owner: 10KartikMistry) [04:52:43] (03CR) 10TerraCodes: [C: 031] lift IP throttle for event at high school in Jesi,Italy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) (owner: 10Dzahn) [04:53:27] (03PS1) 10BryanDavis: Sort backend provided types when displaying help [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/350364 [05:02:47] (03CR) 10BryanDavis: [C: 031] Tools: Fix test for enabled PHP module mcrypt [puppet] - 10https://gerrit.wikimedia.org/r/340059 (https://phabricator.wikimedia.org/T159022) (owner: 10Tim Landscheidt) [05:07:45] (03CR) 10BryanDavis: [C: 031] Tools: Require gridengine-master for gridengine_resource [puppet] - 10https://gerrit.wikimedia.org/r/339921 (https://phabricator.wikimedia.org/T127388) (owner: 10Tim Landscheidt) [05:10:17] PROBLEM - nova-compute process on labvirt1003 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [05:11:17] RECOVERY - nova-compute process on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [05:28:47] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:57] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:57] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:57] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:57] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:57] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:58] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:58] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:59] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:59] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:17] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:17] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:17] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:18] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:18] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:18] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:18] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:27] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:30:37] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:30:47] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:30:47] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:30:47] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:30:47] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:30:47] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:30:48] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [05:30:48] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:30:49] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:30:57] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [05:31:07] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:31:07] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:31:07] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:31:07] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:31:07] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:31:08] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:31:08] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:31:17] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:57:23] !log Deploy alter table enwiki.revision on labsdb1011 - T132416 [05:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:33] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [06:02:17] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2754.70 Read Requests/Sec=831.80 Write Requests/Sec=268.50 KBytes Read/Sec=27474.40 KBytes_Written/Sec=3784.40 [06:05:17] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2566.80 Read Requests/Sec=5319.10 Write Requests/Sec=35.00 KBytes Read/Sec=22305.20 KBytes_Written/Sec=6530.40 [06:10:54] !log Deploy alter table on s3, on db1075 (eqiad master) for tables: change_tag and tag_summary - T147166 [06:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:01] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [06:12:02] (03CR) 10EddieGP: [C: 031] lift IP throttle for event at high school in Jesi,Italy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) (owner: 10Dzahn) [06:14:17] PROBLEM - Host analytics1060 is DOWN: PING CRITICAL - Packet loss = 100% [06:14:18] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=29.60 Read Requests/Sec=0.00 Write Requests/Sec=0.90 KBytes Read/Sec=0.00 KBytes_Written/Sec=13.20 [06:15:37] RECOVERY - Host analytics1060 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms [06:20:18] (03PS1) 10Giuseppe Lavagetto: Reset TTL on etcd RO client record, lower it on RW ones [dns] - 10https://gerrit.wikimedia.org/r/350365 (https://phabricator.wikimedia.org/T159687) [06:20:21] (03PS1) 10Giuseppe Lavagetto: Switch etcd records to codfw [dns] - 10https://gerrit.wikimedia.org/r/350366 (https://phabricator.wikimedia.org/T159687) [06:20:22] (03PS1) 10Giuseppe Lavagetto: Restore TTL for RW etcd records [dns] - 10https://gerrit.wikimedia.org/r/350367 (https://phabricator.wikimedia.org/T159687) [06:22:08] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bawiktionary.tag_summary doesnt exist on query. Default database: bawiktionary. [Query snipped] [06:22:14] ^ expected [06:22:16] Will fix it [06:24:38] (03CR) 10Giuseppe Lavagetto: [C: 032] Reset TTL on etcd RO client record, lower it on RW ones [dns] - 10https://gerrit.wikimedia.org/r/350365 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [06:25:07] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:28:05] (03PS1) 10Giuseppe Lavagetto: role::configcluster: stop replicating to codfw for etcd [puppet] - 10https://gerrit.wikimedia.org/r/350368 (https://phabricator.wikimedia.org/T159687) [06:33:15] (03CR) 10Giuseppe Lavagetto: [C: 032] Switch etcd records to codfw [dns] - 10https://gerrit.wikimedia.org/r/350366 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [06:35:59] 06Operations, 13Patch-For-Review, 15User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3212993 (10Joe) [06:37:56] (03CR) 10Giuseppe Lavagetto: [C: 032] role::configcluster: stop replicating to codfw for etcd [puppet] - 10https://gerrit.wikimedia.org/r/350368 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [06:45:27] !log Deploy alter table on s2, on db1054 (eqiad master) for tables: change_tag and tag_summary - https://phabricator.wikimedia.org/T147166 [06:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:27] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [06:48:32] (03PS2) 10Muehlenhoff: admins: quiddity in WMF group but missing in LDAP user list [puppet] - 10https://gerrit.wikimedia.org/r/350328 (owner: 10Dzahn) [06:48:37] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [06:49:42] <_joe_> the etcd issue is expected [06:50:03] <_joe_> as puppet just ran there stopping etcdmirror but it didn't run on tegmen still [06:50:36] (03PS1) 10Marostegui: db-eqiad.php: Repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350369 (https://phabricator.wikimedia.org/T163548) [06:50:37] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:51:25] (03CR) 10Muehlenhoff: [C: 032] admins: quiddity in WMF group but missing in LDAP user list [puppet] - 10https://gerrit.wikimedia.org/r/350328 (owner: 10Dzahn) [06:52:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350369 (https://phabricator.wikimedia.org/T163548) (owner: 10Marostegui) [06:53:24] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350369 (https://phabricator.wikimedia.org/T163548) (owner: 10Marostegui) [06:53:35] 06Operations, 10Deployment-Systems, 13Patch-For-Review, 10Scap (Scap3-MediaWiki-MVP): Install conftool on deployment masters - https://phabricator.wikimedia.org/T163565#3213002 (10Joe) I don't think scap should interact with conftool by itself, unless it reproduces what the `restart-` scripts are... [06:53:37] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350369 (https://phabricator.wikimedia.org/T163548) (owner: 10Marostegui) [06:55:31] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [06:56:08] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Repool db1071 - T162539 T163548 (duration: 02m 24s) [06:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:19] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [06:56:19] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [06:57:08] (03CR) 10Giuseppe Lavagetto: [C: 032] Restore TTL for RW etcd records [dns] - 10https://gerrit.wikimedia.org/r/350367 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [06:58:59] 06Operations, 13Patch-For-Review, 15User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3213018 (10Joe) [07:01:02] 06Operations, 13Patch-For-Review, 15User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3213020 (10Joe) All clients have been successfully switched to codfw, and replication has been stopped; I tested depooling and pooling back a client (to test again that nginx-based auth work... [07:03:41] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bawiktionary.watchlist doesnt exist on query. Default database: bawiktionary. [Query snipped] [07:04:41] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:06:51] PROBLEM - MariaDB Slave Lag: s2 on db1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.94 seconds [07:07:48] ^ that is expected [07:08:16] <_joe_> manuel, stop toying with databases :P [07:08:44] they are happy they are getting some wikilove! [07:09:31] sexual harassment panda, may have different view [07:09:50] !log Deploy alter table on s6, on db1061 (eqiad master) for tables: change_tag and tag_summary - https://phabricator.wikimedia.org/T147166 [07:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:05] They are happy because they will be healthier once we switchback to eqiad! [07:10:08] I am doing it for them! [07:10:51] RECOVERY - MariaDB Slave Lag: s2 on db1018 is OK: OK slave_sql_lag Replication lag: 0.02 seconds [07:11:05] see? ^ [07:11:06] :) [07:24:47] !log Deploy alter table on s4, on db1068 (eqiad master) for tables: change_tag and tag_summary - https://phabricator.wikimedia.org/T147166 [07:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:40] !log Deploy alter table on s7, on db1062 (eqiad master) for tables: change_tag and tag_summary - https://phabricator.wikimedia.org/T147166 [07:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:55] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "There is a change that slipped into this and shouldn't have. apart from that, LGTM" (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/349731 (https://phabricator.wikimedia.org/T163367) (owner: 10Volans) [07:32:00] (03CR) 10Giuseppe Lavagetto: "I'm not sure this makes it better to me, (DONE vs 2/2 has really no difference), but LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/349732 (https://phabricator.wikimedia.org/T163371) (owner: 10Volans) [07:37:01] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:01] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:01] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:01] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:01] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:02] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:02] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:03] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:03] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:15] ^ backups I guess [07:37:21] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:21] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:21] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:21] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:21] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:22] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:31] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:38:22] probably the backups back with overload [07:39:40] yes, most likely [07:39:41] (03PS1) 10Marostegui: db-eqiad.php: Depool hosts that need to be moved [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350372 (https://phabricator.wikimedia.org/T162681) [07:43:20] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool hosts that need to be moved [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350372 (https://phabricator.wikimedia.org/T162681) (owner: 10Marostegui) [07:44:01] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [07:44:11] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [07:48:58] !log Deploy alter table on s1, on db1052 (eqiad master) for tables: change_tag and tag_summary - https://phabricator.wikimedia.org/T147166 [07:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:50] (03CR) 10Alexandros Kosiaris: [C: 031] Replace $::main_ipaddress by the new ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [07:57:11] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:57:11] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:57:11] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Slave_IO_Running: Yes, Slave_SQL_Running: No, (no error: intentional) [07:57:21] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:57:22] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:57:22] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [07:57:22] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [07:58:01] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [07:58:01] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:58:01] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [07:58:01] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:58:01] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:58:02] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:58:02] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [07:58:03] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [07:58:03] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:58:04] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [07:58:21] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:58:30] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) (owner: 10Dzahn) [07:59:36] !log shutting down mariadb on db1040 as a backup before decommissioning [07:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:21] 06Operations, 10ops-eqiad, 10netops: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3213051 (10elukey) >>! In T163002#3211701, @Ottomata wrote: > Hm, we got a problem! These Kafka nodes are in the Analytics VLAN networks, AND have IPv6 configured.... [08:00:46] (03CR) 10Muehlenhoff: "No, let's revert this. Our OpenLDAP configuration is W/W, it doesn't matter which DC the change is targeted at. And even if we wanted that" [puppet] - 10https://gerrit.wikimedia.org/r/350325 (owner: 10Dzahn) [08:04:58] (03PS2) 10Muehlenhoff: Configure terbium for installation with jessie [puppet] - 10https://gerrit.wikimedia.org/r/349945 [08:10:54] (03CR) 10DCausse: [C: 031] "Doing all of this at the same time will make zh wikis use the new builders while we reindex." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [08:16:51] (03CR) 10Muehlenhoff: [C: 032] Configure terbium for installation with jessie [puppet] - 10https://gerrit.wikimedia.org/r/349945 (owner: 10Muehlenhoff) [08:17:16] (03CR) 10Filippo Giunchedi: [C: 032] base: fix run-puppet-agent --enable help [puppet] - 10https://gerrit.wikimedia.org/r/350124 (owner: 10Filippo Giunchedi) [08:17:23] (03PS4) 10Filippo Giunchedi: base: fix run-puppet-agent --enable help [puppet] - 10https://gerrit.wikimedia.org/r/350124 [08:22:51] !log reimaging terbium to jessie [08:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool hosts that need to be moved [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350372 (https://phabricator.wikimedia.org/T162681) (owner: 10Marostegui) [08:23:55] (03CR) 10Volans: "Puppet compiler results (without the removal of comment in nrpe_local.cfg.erb to get as much as possible NOOPs) here:" [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [08:24:37] (03Merged) 10jenkins-bot: db-eqiad.php: Depool hosts that need to be moved [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350372 (https://phabricator.wikimedia.org/T162681) (owner: 10Marostegui) [08:25:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool hosts that need to be moved [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350372 (https://phabricator.wikimedia.org/T162681) (owner: 10Marostegui) [08:27:41] (03CR) 10Giuseppe Lavagetto: [C: 031] "Small change suggested but not a blocker, LGTM" (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/349737 (https://phabricator.wikimedia.org/T163372) (owner: 10Volans) [08:27:45] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Depool hosts that need to be moved for the network maintenance - T162681 (duration: 02m 25s) [08:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:53] T162681: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681 [08:28:03] (03PS2) 10Volans: Mediawiki: refactor stop/start maintenance [switchdc] - 10https://gerrit.wikimedia.org/r/349737 (https://phabricator.wikimedia.org/T163372) [08:28:14] (03CR) 10Volans: "Replies inline" (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/349731 (https://phabricator.wikimedia.org/T163367) (owner: 10Volans) [08:28:59] 06Operations, 06Multimedia, 10media-storage, 15User-fgiunchedi: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3213074 (10fgiunchedi) @Revent thanks for your report! It looks like those file were moved by steins... [08:29:03] damn gerrit/git review [08:29:21] (03CR) 10Volans: "ignore the last patchset, git review facepalm" [switchdc] - 10https://gerrit.wikimedia.org/r/349737 (https://phabricator.wikimedia.org/T163372) (owner: 10Volans) [08:29:59] !log Deploy alter table on change_tag and tag_summary on silver and labtestweb2001 - T147166 [08:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:07] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [08:30:20] (03PS3) 10Volans: Mediawiki: refactor stop/start maintenance [switchdc] - 10https://gerrit.wikimedia.org/r/349737 (https://phabricator.wikimedia.org/T163372) [08:31:18] (03PS3) 10Volans: IRC logging, make messages more human-friendly [switchdc] - 10https://gerrit.wikimedia.org/r/349731 (https://phabricator.wikimedia.org/T163367) [08:31:56] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, suggested a small change in the code, but that's optional." (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/349738 (https://phabricator.wikimedia.org/T163364) (owner: 10Volans) [08:32:08] !log Gracefully stopping hadoop daemons on Hadoop nodes affected by Row-D maintenance [08:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:09] (03CR) 10Giuseppe Lavagetto: [C: 031] "Very nice!" [switchdc] - 10https://gerrit.wikimedia.org/r/349879 (https://phabricator.wikimedia.org/T163373) (owner: 10Volans) [08:34:30] (03CR) 10Volans: "> I'm not sure this makes it better to me, (DONE vs 2/2 has really no" [switchdc] - 10https://gerrit.wikimedia.org/r/349732 (https://phabricator.wikimedia.org/T163371) (owner: 10Volans) [08:35:19] (03CR) 10Giuseppe Lavagetto: [C: 031] DNS: add removal of confd stale files (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/349880 (https://phabricator.wikimedia.org/T163376) (owner: 10Volans) [08:37:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] IRC logging, make messages more human-friendly (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/349731 (https://phabricator.wikimedia.org/T163367) (owner: 10Volans) [08:38:01] (03CR) 10Giuseppe Lavagetto: [C: 031] IRC logging, make messages more human-friendly [switchdc] - 10https://gerrit.wikimedia.org/r/349731 (https://phabricator.wikimedia.org/T163367) (owner: 10Volans) [08:38:55] _joe_: undecided? :-P [08:39:03] the average is 0 :D [08:39:13] <_joe_> volans: nope, that's just gerrit [08:39:26] <_joe_> the first one is a comment on the old PS [08:39:34] <_joe_> which still had an error, so -1 :P [08:39:43] <_joe_> the latter is on the latest amended revision [08:39:51] lol [08:39:56] anyway, feel free to comment on the task [08:40:10] <_joe_> It's ok, just merge it [08:40:13] I was waiting para.void feedback too there, given that was a request coming from him [08:40:24] sure I'll merge, we can improve it later [08:40:52] <_joe_> I think the change to the message explicitly stating we're switching from dc_from to dc_to is a big improvement in readability [08:43:51] (03CR) 10Volans: [C: 032] IRC logging, make messages more human-friendly [switchdc] - 10https://gerrit.wikimedia.org/r/349731 (https://phabricator.wikimedia.org/T163367) (owner: 10Volans) [08:43:57] agree [08:45:51] (03CR) 10Volans: [C: 032] Make more evident when a submenu was completed [switchdc] - 10https://gerrit.wikimedia.org/r/349732 (https://phabricator.wikimedia.org/T163371) (owner: 10Volans) [08:46:02] (03PS3) 10Volans: Make more evident when a submenu was completed [switchdc] - 10https://gerrit.wikimedia.org/r/349732 (https://phabricator.wikimedia.org/T163371) [08:46:23] did something on terbium changed at 8:20 ? [08:46:58] I think we have been left without db maintenance cronscripts [08:47:48] jynus: 08:22 moritzm: reimaging terbium to jessie [08:47:54] from SAL [08:48:10] <_joe_> jynus: weren't those migrated to wasat? [08:48:11] then those are not running on the codfw site [08:48:20] <_joe_> uhm, let's see [08:48:52] <_joe_> this is a big issue [08:48:59] <_joe_> as the two server should be identical [08:49:09] <_joe_> where are those defined in puppet? [08:49:19] dbmaintenance class somewhere [08:49:35] role::mariadb::maintenance ? [08:50:01] <_joe_> which is applied... nowhere [08:50:12] terbium.eqiad.wmnet,wasat.codfw.wmnet according to puppetdb [08:50:20] maybe it is a permission error [08:50:37] can you check cron being defined [08:50:42] while I check the database? [08:50:50] <_joe_> jynus: yes, doing it now [08:51:09] no, I think this broke by Daniel's commit from last night [08:51:10] <_joe_> modules/profile/manifests/mediawiki/maintenance.pp: include ::role::mariadb::maintenance [08:51:13] <_joe_> right [08:51:40] wit was role(mariadb::maintenance, mediawiki::maintenance, openldap::management) (I have an old local copy) [08:51:46] s/wit/it/ [08:51:51] prior to 97510c3e site.pp also included mariadb::maintenance [08:52:11] <_joe_> moritzm: yeah that is ok [08:52:15] ok [08:52:36] <_joe_> sudo -u tendril crontab -l [08:52:36] <_joe_> no crontab for tendril [08:52:59] <_joe_> on wasat [08:53:08] <_joe_> that is definitely a problem with puppettization [08:53:56] <_joe_> found, fixing it [08:55:01] (03PS1) 10Giuseppe Lavagetto: tendril::maintenance: enable temporarily in codfw [puppet] - 10https://gerrit.wikimedia.org/r/350375 [08:55:30] <_joe_> moritzm: how long before terbium is back online though? [08:56:11] base insstall is completed, initial puppet run is running [08:56:28] ok, we can wait [08:56:34] <_joe_> jynus: I'd wait for that to be back up then, and make a better solution [08:56:38] <_joe_> cool :) [08:56:42] although, on the other side [08:56:47] I would test it on codfw [08:56:52] where it has never been run [08:57:01] <_joe_> jynus: yes, wait for a better puppettization of that :) [08:57:05] ok [08:57:36] to be fair, that may be related to the tendril setup on codfw [08:57:45] which I would wait one month to have it [08:59:31] how to handle that, though if we do not have mw_primary and we should not call etcd? [08:59:51] make a dns check and puppetize based on that? [09:00:37] oh, queries are back [09:01:32] probably it shouldn't be on terbium in the first place [09:01:37] but on dbmonitor [09:01:47] (03PS1) 10Phuedx: Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377 [09:02:21] ^ first time, be kind ^.^ [09:07:42] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [09:08:42] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [09:08:48] (03PS3) 10Jcrespo: Set db1063 as the last server on s5 [software] - 10https://gerrit.wikimedia.org/r/350227 (https://phabricator.wikimedia.org/T162133) [09:08:49] (03PS1) 10Jcrespo: Add wikitech.hosts for rolling schema changes there [software] - 10https://gerrit.wikimedia.org/r/350379 [09:09:06] (03PS2) 10Jcrespo: Add wikitech.hosts for rolling schema changes there [software] - 10https://gerrit.wikimedia.org/r/350379 [09:09:12] (03PS1) 10Giuseppe Lavagetto: role::mediawiki_maintenance: reorganize code [puppet] - 10https://gerrit.wikimedia.org/r/350380 [09:11:01] Joe, I am more than cool with those changes [09:11:20] but I do not think that is related to the issue [09:11:20] <_joe_> jynus: https://puppet-compiler.wmflabs.org/6236/ :) [09:11:48] the issue is that is is easy to forger changing enable/disable [09:11:54] *forget [09:12:04] <_joe_> jynus: no, of course, but my next question is [09:12:19] <_joe_> is tendril's maintenance tied to mediawiki's in any way? [09:12:37] <_joe_> if it's not, it deserves to be treated separately like we're doing [09:12:39] I have no idea what that does [09:12:49] <_joe_> the tendril cron you mean? [09:12:53] so I will investigate an see [09:13:04] it probably isn't [09:13:16] and that is why I said to move it to its own dbmonitor [09:13:34] <_joe_> ok so if that's the case, the current puppetization is ok [09:13:35] but maybe it requires access to a mediawiki config [09:13:44] <_joe_> modulo the small changes i just did [09:13:45] jynus: terbium is back up with jessie [09:14:11] (03CR) 10Giuseppe Lavagetto: [C: 032] "Noop as per the compiler https://puppet-compiler.wmflabs.org/6236/" [puppet] - 10https://gerrit.wikimedia.org/r/350380 (owner: 10Giuseppe Lavagetto) [09:14:18] <_joe_> moritzm: thanks [09:14:41] _joe_, it probably is IDEMPOTENT [09:15:05] so we could run it at the same time from both with half the rate [09:15:54] <_joe_> jynus: that would be cool :) [09:16:03] <_joe_> just lemme know [09:16:16] to be fair, not this week [09:16:25] <_joe_> whenever you want, we can switch to using codfw now just to be sure it works from there [09:16:26] I said on meeting if it is broken [09:16:40] if not, we leave it for another time [09:16:49] !log Shutdown es1019 for maintenance - T162681 [09:16:50] not worth [09:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:56] if is fixed again [09:16:57] T162681: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681 [09:17:08] leave the patch there, I will look at it next week [09:17:28] _joe_, to be fair, the fact that it is pupetized [09:17:33] which I think it was mostly you [09:17:39] it already a huge win [09:17:45] (03PS1) 10Alexandros Kosiaris: Add analytics1-b-eqiad IPv6 stanza [dns] - 10https://gerrit.wikimedia.org/r/350381 (https://phabricator.wikimedia.org/T163002) [09:17:48] compared to how we found it [09:17:49] (03CR) 10Marostegui: [C: 031] Add wikitech.hosts for rolling schema changes there [software] - 10https://gerrit.wikimedia.org/r/350379 (owner: 10Jcrespo) [09:17:55] so, thank you, _joe_ [09:19:57] I think the only reason it was on terbium was because it had access to the database [09:20:26] so I will move it to be auto-contained [09:20:39] and if possible active-active [09:21:12] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [09:21:22] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [09:21:22] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [09:21:26] (03PS1) 10Giuseppe Lavagetto: profile::mariadb::maintenance: switch to codfw [puppet] - 10https://gerrit.wikimedia.org/r/350382 [09:21:32] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [09:21:32] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [09:21:32] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [09:21:32] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [09:21:32] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [09:21:39] <_joe_> ema: is this you? [09:21:55] <_joe_> this is the second set of ipsec failures I see this morning [09:22:08] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [09:22:18] <_joe_> jynus: in case you want to test tendril maintenance from codfw, https://gerrit.wikimedia.org/r/350382 [09:22:18] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [09:22:28] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [09:22:28] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [09:22:28] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [09:22:38] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [09:22:38] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [09:23:18] <_joe_> elukey: hadoop-yarn-nodemanager failing on most nodes [09:23:28] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [09:23:42] <_joe_> oh was acked, sorry [09:24:11] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3183025 (10akosiaris) analytics1-b-eqiad IPv6 does exist and it's `2620:0:861:105/64`. I 've update the DNS templates in the patch above to refl... [09:24:20] _joe_: yep, it's me :) [09:24:31] !log Shutdown db1094, db1093, db1091 for maintenance - T162681 [09:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:39] T162681: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681 [09:25:12] akosiaris++ [09:25:14] (03PS4) 10Volans: Mediawiki: refactor stop/start maintenance [switchdc] - 10https://gerrit.wikimedia.org/r/349737 (https://phabricator.wikimedia.org/T163372) [09:25:48] _joe_ those should be downtimed, I am prepping for row-d maintenance (also logged in sal) [09:26:15] <_joe_> elukey: yeah sorry [09:26:58] _joe_ no no thanks for letting me know, it is good to double check since I am shutting down half of the hadoop cluster :D [09:27:06] (03CR) 10Volans: "Changes done." (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/349737 (https://phabricator.wikimedia.org/T163372) (owner: 10Volans) [09:27:12] Let me know if I can help with anything for row-d [09:27:48] XioNoX: o/ - Alex solved my doubt about the kafka nodes, I think that we are good [09:28:12] saw the update, that's cool! [09:28:45] XioNoX: hey, re: today's maintenance, ping me when the time comes to shut ms-be, for those is enough a graceful "poweroff" I can take care of it [09:29:54] godog: according to the schedule, rack moves is going to be starting at 14:30utc [09:30:28] (03CR) 10Volans: "Changes done" (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/349738 (https://phabricator.wikimedia.org/T163364) (owner: 10Volans) [09:31:16] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3213176 (10Joe) For the record, @akosiaris and me switched etcd client traffic to codfw to allow relocating conf1003 with ample time. [09:32:11] (03CR) 10Alexandros Kosiaris: [C: 032] Add analytics1-b-eqiad IPv6 stanza [dns] - 10https://gerrit.wikimedia.org/r/350381 (https://phabricator.wikimedia.org/T163002) (owner: 10Alexandros Kosiaris) [09:32:14] XioNoX: ok thanks! is there a list of all hosts affected in phab ? I can't find it apparently :( [09:34:09] (03CR) 10Volans: "I'll wait to merge the previous ones before uploading the changes, gerrit wants me to push also the previous ones that are changed now." [switchdc] - 10https://gerrit.wikimedia.org/r/349737 (https://phabricator.wikimedia.org/T163372) (owner: 10Volans) [09:34:13] (03CR) 10Jcrespo: [C: 032] Add wikitech.hosts for rolling schema changes there [software] - 10https://gerrit.wikimedia.org/r/350379 (owner: 10Jcrespo) [09:34:32] (03PS2) 10Jcrespo: mariadb: promote db1063 as s5 master [puppet] - 10https://gerrit.wikimedia.org/r/350228 (https://phabricator.wikimedia.org/T162133) [09:34:38] (03CR) 10Jcrespo: [C: 032] mariadb: promote db1063 as s5 master [puppet] - 10https://gerrit.wikimedia.org/r/350228 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [09:39:47] godog: I think in the email I sent to ops@ there are links to racktables [09:42:15] !log restarting mariadb at db1063 [09:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:22] XioNoX: indeed! thanks [09:48:22] !log migrating s5 eqiad replicas under db1063 [09:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:36] ^marostegui [09:48:45] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3213213 (10elukey) >>! In T148506#3213176, @Joe wrote: > For the record, @akosiaris and me switched etcd client traffic to codfw to allow relocat... [09:48:49] \ø/ [09:49:03] we will see if they break again [09:50:08] something is writing into db1063 [09:50:12] and I do not know what it is [09:50:18] uuh?? [09:50:19] lets see [09:50:58] db1049 is stopped [09:51:14] heartbeat? [09:51:19] ah, I know [09:51:24] the other heartbeat instance [09:51:51] I had stopped puppet on db1063 [09:51:54] but not on 49 [09:52:05] :) [09:52:05] that should do it [09:52:34] WE ARE GOOD [09:52:39] sorry for the caps [09:53:32] yay!! [09:53:58] no error this time [09:54:20] maybe it was me all this time? [09:54:33] i don't think so [09:54:44] I think it was gtid-binlog issue [09:55:02] as you said, somethning maybe was written directly on that slave or who knows [09:55:21] I think maybe [09:55:40] gtid doesn't know what to do if it is ahead of the new master? [09:55:56] but that should never happen [09:56:05] but even after the slave gets delayed, it gets broken forver [09:56:06] unless you have written directly to it [09:56:07] it could [09:56:17] I didn't stop the replication before [09:56:23] only now to be 100% sure [09:56:50] on the new master, I mean [09:57:05] yeah [09:57:10] don't know [09:57:18] maybe all this slave_pos vs current_pos is playing a role here [09:57:25] but it is confusing [09:57:43] wait, is it really working? [09:58:08] so far i see hosts hanging from db1063 [09:58:12] yes [09:58:15] but it says [09:58:18] PROBLEM - MariaDB Slave Lag: s5 on db1082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 613.55 seconds [09:58:23] Exec_Master_Log_Pos: 4 [09:58:28] ^that is ok, expected [09:58:55] Exec_Master_Log_Pos: 13981010 [09:58:58] that is one of the slaves [09:59:00] and master is on 13981010 [09:59:17] db1070 [09:59:23] let me check db1070 [09:59:30] db1045 and db1026 look good [09:59:43] db1070 will break? [09:59:50] Exec_Master_Log_Pos: 13981010 [09:59:56] that is db1070 [10:00:03] I swear it sead another value [10:00:46] not crazy, have it printed on screen [10:00:57] maybe it takes minutes to say it correctly? [10:01:03] i am checking the log [10:01:06] to see if it says something [10:01:18] 70426 9:57:01 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='db1049.eqiad.wmnet', master_port='3306', master_log_file='db1049-bin.004713', master_log_pos='788124249'. New state master_host='db1063.eqiad.wmnet', master_port='3306', master_log_file='', master_log_pos='4'. [10:01:24] master_log_pos=4 indeed [10:01:36] but then [10:01:43] 170426 9:57:04 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@db1063.eqiad.wmnet:3306' in log 'FIRST' at position 4 [10:01:46] 170426 9:57:04 [Note] Slave I/O thread: connected to master 'repl@db1063.eqiad.wmnet:3306',replication starts at GTID position '180359179-180359179-49314181,0-171970704-5733475846,171970704-171970704-351094624' [10:02:19] so should I continue? [10:02:25] I would continue [10:02:37] And if db1070 breaks, we can reclone it [10:02:45] so far db1026 and db1045 never had issues [10:03:01] yeah, I am not worried about that [10:03:11] but if *all* breaks :-) [10:03:40] I am going to check other slaves from yesterday switchovers [10:03:45] to see if they had the same thing on the logs [10:03:46] look at 71 [10:03:53] ok [10:04:06] same [10:04:07] yes [10:04:14] I think it takes minutes to reposition itself? [10:04:15] let me see s6 for example [10:06:05] 3 slaves from s6 didn't have that issue [10:06:09] but one slave from s7 did [10:06:21] 170425 14:35:47 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='db1041.eqiad.wmnet', master_port='3306', master_log_fil [10:06:24] e='db1041-bin.003051', master_log_pos='919243356'. New state master_host='db1062.eqiad.wmnet', master_port='3306', master_log_file='', [10:06:27] master_log_pos='4'. [10:06:29] 170425 14:35:47 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos [10:06:32] 170425 14:35:50 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@db1062.eqiad.wmnet:3306' in log 'FIRST' at position 4 [10:06:45] db1079 from s7 also had that [10:07:41] db1090 from s2 too [10:07:45] so I think it is "normal"? [10:08:48] I do not like the slave giving fake positions- it should say something else [10:09:19] well, it is not really a fake, it is the first one until it finds the correct one scanning all the binlog or the gtid table or whatever it scans [10:09:27] i am trying to understand the logic [10:09:28] XD [10:09:44] yeah, because that is going to work marverlously with a script [10:10:10] 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3213244 (10elukey) Added some TCP graphs to https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhv... [10:11:25] so I think I understood the error [10:11:35] if out of band writes are done [10:11:45] it errors out [10:11:58] out of band as in directly writing to that host? [10:12:06] for example, the few pt-heartbeat executions [10:12:15] that ran on db1063 [10:12:24] coming from the upstream master ,yes [10:12:28] no [10:12:31] from itself [10:12:33] jynus: marostegui any objection to me deploying https://gerrit.wikimedia.org/r/#/c/350384 now? (Just making sure I'm not going to get in the way of anything you were going to push out db config wise) [10:12:58] you have 3 minutes, addshore [10:13:03] ack! [10:14:52] jynus but that is where the slave_pos and current_pos might be playing a role [10:15:06] it is so confusing.. [10:15:55] I think that model doesn't work [10:15:57] for us [10:15:58] (03CR) 10Giuseppe Lavagetto: [C: 031] Mediawiki: refactor stop/start maintenance [switchdc] - 10https://gerrit.wikimedia.org/r/349737 (https://phabricator.wikimedia.org/T163372) (owner: 10Volans) [10:16:08] as we do some out-of-band changes + multisource [10:16:42] yeah, not to mention that gtid+multisource is completely broken [10:16:42] (03CR) 10Volans: [C: 032] Mediawiki: refactor stop/start maintenance [switchdc] - 10https://gerrit.wikimedia.org/r/349737 (https://phabricator.wikimedia.org/T163372) (owner: 10Volans) [10:16:45] as we have proved [10:16:50] (03PS1) 10Filippo Giunchedi: swift: create required LV in labs [puppet] - 10https://gerrit.wikimedia.org/r/350389 (https://phabricator.wikimedia.org/T162247) [10:16:55] not broken [10:17:16] just not ok for not simplistic scenarios [10:17:25] jynus: all done [10:17:37] not really, the multisource+gtid is broken, because it doesn't work for a simple scenario of 2 threads [10:19:18] RECOVERY - MariaDB Slave Lag: s5 on db1082 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:21:04] (03PS2) 10Filippo Giunchedi: profile: introduce swift::storage::labs [puppet] - 10https://gerrit.wikimedia.org/r/350389 (https://phabricator.wikimedia.org/T162247) [10:22:30] (03CR) 10jerkins-bot: [V: 04-1] profile: introduce swift::storage::labs [puppet] - 10https://gerrit.wikimedia.org/r/350389 (https://phabricator.wikimedia.org/T162247) (owner: 10Filippo Giunchedi) [10:23:47] (03PS3) 10Filippo Giunchedi: profile: introduce swift::storage::labs [puppet] - 10https://gerrit.wikimedia.org/r/350389 (https://phabricator.wikimedia.org/T162247) [10:23:55] I think no database started screaming [10:24:30] (03CR) 10Jcrespo: [C: 032] mariadb: Promote db1063 as the master of s5 eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350230 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [10:24:33] (03PS2) 10Jcrespo: mariadb: Promote db1063 as the master of s5 eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350230 (https://phabricator.wikimedia.org/T162133) [10:31:12] (03CR) 10Jcrespo: [C: 032] Set db1063 as the last server on s5 [software] - 10https://gerrit.wikimedia.org/r/350227 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [10:31:16] (03PS2) 10Volans: DNS Discovery: add a check for the resolved address [switchdc] - 10https://gerrit.wikimedia.org/r/349738 (https://phabricator.wikimedia.org/T163364) [10:31:47] (03CR) 10jenkins-bot: mariadb: Promote db1063 as the master of s5 eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350230 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [10:32:52] (03CR) 10Ladsgroup: [C: 032] "Per discussion in the phab card. It's a beta cluster patch only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [10:35:18] (03Merged) 10jenkins-bot: Enable echo notification for wikibase clients in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [10:35:44] (03CR) 10Giuseppe Lavagetto: [C: 031] DNS Discovery: add a check for the resolved address [switchdc] - 10https://gerrit.wikimedia.org/r/349738 (https://phabricator.wikimedia.org/T163364) (owner: 10Volans) [10:36:15] (03CR) 10jenkins-bot: Enable echo notification for wikibase clients in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [10:39:26] !log jynus@naos Synchronized wmf-config/db-eqiad.php: switch s5 eqiad master from db1049 to db1063 (duration: 01m 24s) [10:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:43] s5 core replication looks good [10:40:55] \o/ [10:42:59] (03PS1) 10Dereckson: Planet: Remove not maintained blog from fr feed [puppet] - 10https://gerrit.wikimedia.org/r/350391 [10:44:42] !log Deploy alter table on s5, on db1063 (eqiad master) for tables: change_tag and tag_summary - https://phabricator.wikimedia.org/T147166 [10:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:08] 06Operations, 10DBA: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3213336 (10jcrespo) [10:45:54] !log ladsgroup@naos Synchronized static/images/wikibase/echoIcon.svg: T142104, part I (duration: 01m 04s) [10:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:02] T142104: Decide on a nice logo for Echo notifications - https://phabricator.wikimedia.org/T142104 [10:46:30] (03CR) 10Volans: [C: 032] DNS Discovery: add a check for the resolved address [switchdc] - 10https://gerrit.wikimedia.org/r/349738 (https://phabricator.wikimedia.org/T163364) (owner: 10Volans) [10:47:24] !log ladsgroup@naos Synchronized wmf-config/Wikibase-labs.php: T142104, part II (duration: 00m 56s) [10:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:28] (03PS2) 10Volans: Traffic: add automatic verification of the changes [switchdc] - 10https://gerrit.wikimedia.org/r/349879 (https://phabricator.wikimedia.org/T163373) [10:49:38] (03PS2) 10Volans: DNS: add removal of confd stale files [switchdc] - 10https://gerrit.wikimedia.org/r/349880 (https://phabricator.wikimedia.org/T163376) [10:54:08] (03PS1) 10Alexandros Kosiaris: ores: Move the application (web+worker) to profiles [puppet] - 10https://gerrit.wikimedia.org/r/350393 [10:59:18] PROBLEM - MariaDB Slave Lag: s5 on db1082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 651.49 seconds [11:00:13] mm [11:00:31] or is is just the schema change? [11:00:41] could be [11:00:45] let me see [11:01:09] there is no lag [11:01:12] there is lag, but not breakage [11:01:13] so maybe it was it [11:01:19] no, lag on the master [11:01:23] ah yes [11:01:24] not seen with slave status [11:01:28] the master yes [11:01:31] the schema change [11:01:33] I was checking db1082 [11:01:35] good [11:01:40] I will ack the lag [11:01:43] on icinga [11:01:43] i thought that host was downtimed [11:02:41] that should be it [11:03:01] althoug it will print here when recovered [11:03:17] (03PS2) 10Alexandros Kosiaris: ores: Move the application (web+worker) to profiles [puppet] - 10https://gerrit.wikimedia.org/r/350393 [11:08:17] 06Operations, 10Mail, 10Wikimedia-Mailing-lists, 05Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3213374 (10KTC) p:05Normal>03High And x2 again today. This is getting ridiculous. [11:09:33] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/6238/ says NOOP, merging" [puppet] - 10https://gerrit.wikimedia.org/r/350393 (owner: 10Alexandros Kosiaris) [11:09:38] (03PS3) 10Alexandros Kosiaris: ores: Move the application (web+worker) to profiles [puppet] - 10https://gerrit.wikimedia.org/r/350393 [11:09:43] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: Move the application (web+worker) to profiles [puppet] - 10https://gerrit.wikimedia.org/r/350393 (owner: 10Alexandros Kosiaris) [11:22:35] (03Abandoned) 10Gehel: relforge - add LVS entry [puppet] - 10https://gerrit.wikimedia.org/r/346148 (https://phabricator.wikimedia.org/T162037) (owner: 10Gehel) [11:22:39] (03Abandoned) 10Gehel: elasticsearch - create LVS service for relforge [dns] - 10https://gerrit.wikimedia.org/r/346146 (https://phabricator.wikimedia.org/T162037) (owner: 10Gehel) [11:25:06] (03PS1) 10Ladsgroup: Enable sendEchoNotification for test Wikibase clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350396 (https://phabricator.wikimedia.org/T142102) [11:28:20] (03PS1) 10Aude: Don't enable tabular-data data type yet on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350397 [11:31:03] !log rebooting mwlog2001 for update to Linux 4.9 [11:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:56] (03PS1) 10Alexandros Kosiaris: ores: Switch the redis database role to profile [puppet] - 10https://gerrit.wikimedia.org/r/350398 [11:32:06] !log applying new events_coredb_slave.sql on db2055 T160984 [11:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:13] T160984: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984 [11:34:09] (03PS2) 10Aude: Don't enable tabular-data data type yet on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350397 [11:34:50] (03PS1) 10Alexandros Kosiaris: Rename ores::redis::password [labs/private] - 10https://gerrit.wikimedia.org/r/350399 [11:35:30] no kills or abnormal behaviour- I will monitor ops.event_log in the following hours [11:37:23] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Rename ores::redis::password [labs/private] - 10https://gerrit.wikimedia.org/r/350399 (owner: 10Alexandros Kosiaris) [11:52:11] (03PS2) 10Alexandros Kosiaris: ores: Switch the redis database role to profile [puppet] - 10https://gerrit.wikimedia.org/r/350398 [12:01:05] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/6240/ says NOOP, merging" [puppet] - 10https://gerrit.wikimedia.org/r/350398 (owner: 10Alexandros Kosiaris) [12:01:11] (03PS3) 10Alexandros Kosiaris: ores: Switch the redis database role to profile [puppet] - 10https://gerrit.wikimedia.org/r/350398 [12:01:21] (03PS1) 10Giuseppe Lavagetto: profile::redis::multidc_instance: separate concerns with confd [puppet] - 10https://gerrit.wikimedia.org/r/350404 [12:01:37] <_joe_> akosiaris: ^^ take a look and tell me if you prefer this to what we have now [12:02:05] <_joe_> it's not great either, but the templating we can do is extremely limited in confd [12:03:44] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: Switch the redis database role to profile [puppet] - 10https://gerrit.wikimedia.org/r/350398 (owner: 10Alexandros Kosiaris) [12:04:16] _joe_: ok will do [12:11:33] 06Operations, 10netops: Interface errors on cr2-eqiad:xe-4/3/1 - https://phabricator.wikimedia.org/T163542#3213510 (10ayounsi) 1. No errors on telia's side, see: ``` show interfaces TenGigE0/12/0/25 detail Mon Apr 24 11:01:19.740 CET TenGigE0/12/0/25 is up, line protocol is up Interface state transitions: 1... [12:12:32] 06Operations, 10ops-eqiad: cp1066 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T163889#3213527 (10ema) [12:14:58] (03PS1) 10Muehlenhoff: Fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/350407 [12:24:17] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/350407 (owner: 10Muehlenhoff) [12:26:44] (03CR) 10Muehlenhoff: [C: 032] Fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/350407 (owner: 10Muehlenhoff) [12:31:08] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:08] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:08] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:08] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1066_v4, cp1066_v6 [12:31:08] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:09] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:09] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:18] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1066_v4, cp1066_v6 [12:31:18] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:18] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:18] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:18] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:28] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1066_v4, cp1066_v6 [12:31:28] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:28] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:28] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:28] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1066_v4, cp1066_v6 [12:31:29] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1066_v4, cp1066_v6 [12:31:29] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1066_v4, cp1066_v6 [12:31:30] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1066_v4, cp1066_v6 [12:31:38] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp1066_v6 [12:31:38] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp1066_v6 [12:31:39] that's cp1066 not rebooting properly :( ^ [12:31:50] oh no, it just did come up :) [12:32:08] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 44 ESP OK [12:32:08] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 44 ESP OK [12:32:08] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 44 ESP OK [12:32:08] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [12:32:08] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 44 ESP OK [12:32:09] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 44 ESP OK [12:32:09] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 44 ESP OK [12:32:18] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [12:32:19] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 44 ESP OK [12:32:19] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 44 ESP OK [12:32:19] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 44 ESP OK [12:32:19] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 44 ESP OK [12:32:28] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [12:32:28] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 44 ESP OK [12:32:28] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 44 ESP OK [12:32:28] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 44 ESP OK [12:32:28] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 44 ESP OK [12:32:29] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [12:32:29] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [12:32:30] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [12:32:38] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [12:32:38] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [12:39:55] 06Operations, 10vm-requests: codfw: VM request for poolcounter2001 - https://phabricator.wikimedia.org/T163892#3213611 (10MoritzMuehlenhoff) [12:43:54] !log ema@neodymium conftool action : set/pooled=no; selector: name=cp2014.codfw.wmnet,service=varnish-be [12:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:20] (03PS2) 10Gehel: tests: no more ignore postgresql spec [puppet] - 10https://gerrit.wikimedia.org/r/345849 (owner: 10Hashar) [12:46:35] !log installing mysql security updates (5.5 as packaged in Debian jessie) [12:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:18] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:47:34] known ^ [12:48:56] (03CR) 10Gehel: [C: 032] tests: no more ignore postgresql spec [puppet] - 10https://gerrit.wikimedia.org/r/345849 (owner: 10Hashar) [12:50:00] (03PS2) 10Gehel: Align elasticsearch jvm options with upstream [puppet] - 10https://gerrit.wikimedia.org/r/345632 (https://phabricator.wikimedia.org/T161830) (owner: 10EBernhardson) [12:50:42] 06Operations, 10Mail, 10Wikimedia-Mailing-lists, 05Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3213641 (10Liuxinyu970226) >>! In T160529#3213374, @KTC wrote: > And x2 again today. This is getting ridiculous. Hello Mr @KTC, as [[ https://www.mediawiki.org/wiki/Bug_management... [12:52:28] PROBLEM - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:20] wasn't ocg already down? [12:53:44] 1001 [12:54:02] jynus: i believe so, but i think checks after the inital report of the host going down still will report notifications unless acked or downtime'd [12:54:05] looking [12:54:37] (03CR) 10Gehel: [C: 032] Align elasticsearch jvm options with upstream [puppet] - 10https://gerrit.wikimedia.org/r/345632 (https://phabricator.wikimedia.org/T161830) (owner: 10EBernhardson) [12:54:50] the alert is new [12:54:52] not old [12:55:26] !log restart elasticsearch on relforge1001 to validate new config - T161830 [12:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:36] jynus: oh sorry I was thinking another channel my bad... [12:55:36] T161830: Resolve inconsistencies between deployed jvm options and upstream suggested jvm options for elasticsearch - https://phabricator.wikimedia.org/T161830 [12:56:08] there is not history for the host [12:56:18] RECOVERY - MariaDB Slave Lag: s5 on db1082 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:56:18] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:56:24] einstenium shwich related? [12:56:31] !log Shutdown db1092 for maintenance - https://phabricator.wikimedia.org/T162681 [12:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:50] ocg1001 is still disabled, though (hieradata/hosts/ocg1001.yaml), so this does not have any real impact [12:57:07] monitoring weirdness, then? [12:57:35] yeah, seems so [12:58:28] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 66 threshold =0.1 breach: status: red, number_of_nodes: 1, unassigned_shards: 66, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 69, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 51.11111111 [12:59:06] elasticsearch is me, silencing relforge1001 was not enough... [12:59:48] RECOVERY - Host ocg1001 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [12:59:56] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3213672 (10Marostegui) The databases affected by the move are now off and can be moved anytime: ``` es1019 db1094 db1093 db1092 db1091 ``` [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170426T1300). [13:00:04] matthiasmullie, mutante, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:10] o/ [13:00:11] !log cp2017: restart varnish-be [13:00:16] here! [13:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:00] o/ [13:01:18] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:01:20] Hi [13:01:26] should I do the swat? anybody else wants to? [13:01:36] hashar is traveling today [13:01:52] * aude waves [13:01:58] zeljkof: I've noticed a discussion about the mutante change, they aren't available to the swat, but you can merge it, the change is fine, and untestable anyway [13:02:01] aude: ^ [13:02:08] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:02:09] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:02:09] suppose i could do swat [13:02:24] aude: great, go ahead :) [13:02:24] at least looking at what there is [13:02:27] !log ema@neodymium conftool action : set/pooled=yes; selector: name=cp2014.codfw.wmnet,service=varnish-be [13:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:48] anyone used vagrant recently ? it's failing on git clone for mediawiki with a TLS problem for me. [13:02:56] the ip throttle should be ok [13:03:37] I have 1 to add to swat if possible :) [13:03:51] addshore: ok [13:04:02] thedj: what's the OpenSSL version shipped? [13:04:12] RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function. [13:04:24] (03CR) 10Aude: [C: 032] lift IP throttle for event at high school in Jesi,Italy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) (owner: 10Dzahn) [13:04:30] ah compiled against GnuTLS and not OpenSSL [13:04:37] for some reason, someone thinks that gnutls is still qualified to include in debian [13:05:02] what would you want, libressl ? [13:05:23] matthiasmullie: around? [13:05:46] aude: do you mine if I do mine alongside yours? https://gerrit.wikimedia.org/r/#/c/350409/ Mine is on a core branch & I'll actually only scap pull it to mwdebug1001/2 test it and then revert it (if not I'll just wait until the end) [13:05:59] (03Merged) 10jenkins-bot: lift IP throttle for event at high school in Jesi,Italy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) (owner: 10Dzahn) [13:06:07] thedj: it seems to be a known issue for a lot of person when I see the results for GnuTLS "Error in the pull function" [13:06:07] all of yours are in config :) [13:06:20] https://github.com/google/ios-webkit-debug-proxy/issues/166 for example [13:06:27] addshore: ok [13:06:30] yeah. and the advice is mostly "recompile with ssh", but no indication of cause or anything. [13:07:23] thedj: so wild guess is GnuTLS and Gerrit can agree about what cyphers to use [13:07:27] addshore: i'm not sure mwdebug works though [13:07:35] can't [13:07:39] aude: it does :) [13:07:41] we are on naos [13:07:41] (03CR) 10jenkins-bot: lift IP throttle for event at high school in Jesi,Italy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350363 (https://phabricator.wikimedia.org/T163870) (owner: 10Dzahn) [13:07:52] codfw [13:08:00] use mw2017 [13:08:07] if you wish a codfw debug server [13:08:28] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: red, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 4, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 129, task_max_waiting_in_queue_millis: 343, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 97.8102189781, active_shards: 134, init [13:09:16] Dereckson: https://phabricator.wikimedia.org/T152801 [13:10:05] !log aude@naos Synchronized wmf-config/throttle.php: (no justification provided) (duration: 01m 23s) [13:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:33] aude: my patch is all done and out of the way :) (Will mark it as so on the calander) [13:10:39] ok [13:11:06] (03CR) 10Aude: [C: 032] Enable sendEchoNotification for test Wikibase clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350396 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [13:11:18] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:12:08] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:12:54] 06Operations, 10ops-eqiad, 10DBA: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3213695 (10Marostegui) [13:13:08] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:14:07] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3213711 (10Gilles) [13:14:23] waiting for jenkins [13:15:11] (03Merged) 10jenkins-bot: Enable sendEchoNotification for test Wikibase clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350396 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [13:15:24] (03CR) 10Aude: [C: 032] Don't enable tabular-data data type yet on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350397 (owner: 10Aude) [13:15:44] (03CR) 10jenkins-bot: Enable sendEchoNotification for test Wikibase clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350396 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [13:16:48] (03PS1) 10Marostegui: db-eqiad.php: Repool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350411 (https://phabricator.wikimedia.org/T162539) [13:17:16] urandom, gehel, elukey, ~1h ish before maintenance ping: https://phabricator.wikimedia.org/T148506#3205842 [13:17:35] thankssss [13:17:45] thanks! [13:19:04] matthiasmullie: around? [13:19:09] yes [13:19:16] There are 18 shards left on elasticsearch row D, cluster will go yellow when we'll loose the row, but that's not an issue [13:19:24] i'll deploy your change in a few minutes [13:19:29] suppose it's labs only though [13:19:31] so should be fine [13:19:52] (03CR) 10Aude: [C: 032] Turn $wg3dProcessor into an array of arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350125 (owner: 10Matthias Mullie) [13:20:20] yeah it's labs only, just need to get it synced to prevent conflicts with other config changes :) [13:20:33] yep [13:21:23] cmjohnson1: good morning! let me know when you're in the DC [13:22:01] !log restart HDFS on analytics100[12] (Hadoop master nodes) to pick up recent topology changes for the cluster [13:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:31] (03PS2) 10Marostegui: db-eqiad.php: Repool db1026, depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350411 (https://phabricator.wikimedia.org/T162539) [13:23:23] !log Deploy alter table db1045 - https://phabricator.wikimedia.org/T162539 https://phabricator.wikimedia.org/T163548 [13:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:48] (03PS1) 10Gehel: elasticsearch - cleanup hiera lookups with default "undef" [puppet] - 10https://gerrit.wikimedia.org/r/350413 [13:24:16] xi0Nox [13:24:18] (03PS1) 10Andrew Bogott: openstack: Temporarily disable self service instance creation and deletion on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350414 [13:24:27] xionox. here [13:25:11] (03PS2) 10Andrew Bogott: openstack: Temporarily disable self service instance creation and deletion on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350414 [13:26:14] cmjohnson1: cool, I'll need you for the re-cableing of cr2-asw2 in ~40min [13:27:01] !log Deploy alter table labsdb1001 https://phabricator.wikimedia.org/T162539 https://phabricator.wikimedia.org/T163548 [13:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:10] XioNoX: I am going to move all servers to the same port #'s and descriptions they're currently...2/0/14 will become 8/0/14 and so forth....for d8 to d2 that may be an issue but will confirm [13:27:57] cmjohnson1: thanks, are you doing the switch port configuration as well or should I? [13:28:07] (03PS3) 10Aude: Don't enable tabular-data data type yet on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350397 [13:28:13] (03CR) 10Aude: [V: 032 C: 032] Don't enable tabular-data data type yet on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350397 (owner: 10Aude) [13:29:44] !log Deploy alter table on db1069 (wikidatawiki) https://phabricator.wikimedia.org/T162539 https://phabricator.wikimedia.org/T163548 [13:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:18] aude: did you deploy the test wiki change? [13:30:20] (03Merged) 10jenkins-bot: Don't enable tabular-data data type yet on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350397 (owner: 10Aude) [13:30:32] (03CR) 10jenkins-bot: Don't enable tabular-data data type yet on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350397 (owner: 10Aude) [13:30:40] (03CR) 10Andrew Bogott: [C: 032] openstack: Temporarily disable self service instance creation and deletion on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350414 (owner: 10Andrew Bogott) [13:32:07] Amir1: deploying [13:32:24] Amir1: not sure how best to verify... maybe you can do? [13:32:40] yeah [13:32:59] !log aude@naos Synchronized wmf-config/Wikibase-production.php: disable tabular-data for now on wikidata and enable echo notification on test wikis (duration: 01m 06s) [13:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:21] addshore: i see some warnings [13:33:24] Warning: onEnhancedChangesListModifyBlockLineData() expects exactly 4 parameters, 3 given in /srv/mediawiki/php-1.29.0-wmf.21/extensions/Flow/Hooks.php on line 538 [13:33:29] not sure if they are related or what [13:33:41] aude patch cannot merge, should I rebase it? (https://gerrit.wikimedia.org/r/#/c/350125/) [13:33:52] Notice: Undefined variable: classes in /srv/mediawiki/php-1.29.0-wmf.21/extensions/Flow/Hooks.php on line 553 [13:33:57] matthiasmullie: i can do it [13:34:04] aude: did you mean to ping Amir1 ? That shouldnt be anything related to bme afaik [13:34:29] addshore: ok [13:34:41] it's somthing related to enhanced changes [13:34:47] aude: My patch was on wmf20 (also it is already reverted and didnt go out to the cluster) [13:34:54] ok [13:35:03] aude: okay, the thing I just got a stacktrace for is https://phabricator.wikimedia.org/T163874 [13:35:07] i'll make a ticket, if one doesn't exist yet [13:35:44] (03PS3) 10Aude: Turn $wg3dProcessor into an array of arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350125 (owner: 10Matthias Mullie) [13:35:54] (03CR) 10Aude: [V: 032 C: 032] Turn $wg3dProcessor into an array of arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350125 (owner: 10Matthias Mullie) [13:36:30] all seems possibly related or at least the same area of code [13:36:39] but i don't have time to investigte right now [13:37:35] (03PS1) 10Jcrespo: Add --disable-auto-rehash by defaul on my .my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/350416 [13:37:48] (03Merged) 10jenkins-bot: Turn $wg3dProcessor into an array of arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350125 (owner: 10Matthias Mullie) [13:37:56] (03CR) 10jenkins-bot: Turn $wg3dProcessor into an array of arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350125 (owner: 10Matthias Mullie) [13:39:33] Amir1: deployed [13:39:40] Thanks [13:39:43] let me test [13:40:20] !log aude@naos Synchronized wmf-config/CommonSettings-labs.php: (no justification provided) (duration: 00m 57s) [13:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:31] swat is done [13:44:20] <_joe_> !log shutting down mc1013-18 for row D maintenance [13:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:53] 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3213837 (10Reedy) 05Open>03Resolved a:03Reedy ``` mysql:wikiadmin@db1084 [commonswiki]> select * from page where page_title = ''; Empty set (27.45... [13:47:07] (03CR) 10Jcrespo: [C: 032] Add --disable-auto-rehash by defaul on my .my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/350416 (owner: 10Jcrespo) [13:50:37] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3213910 (10Andrew) We're pretty sure that the only Labs thing affected by this is instance creation. I've disabled instance creation for now, wi... [13:52:26] PROBLEM - IPsec on mc2031 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc1013_v4 [13:52:47] XioNoX: during the row D maintenance, we will also have the "ElasticSearch health check for shards" check on logstash nodes to silence [13:53:02] XioNoX: you want me to take care of those already? [13:53:04] !log stop kafka on kafka1020 and kafka1018 for row-d extended maintenance (D2) [13:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:12] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [13:53:33] <_joe_> ahahahah [13:53:56] yeah, that is annoying [13:53:57] <_joe_> let's use D-2 [13:53:59] gehel: yes please [13:54:03] (03PS3) 10Marostegui: db-eqiad.php: Repool db1026, depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350411 (https://phabricator.wikimedia.org/T162539) [13:54:12] okok :P [13:54:15] !log downtime "ElasticSearch health check for shards" checks for logstash and elasticsearch eqiad - T148506 [13:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:23] T148506: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506 [13:55:37] XioNoX: done (I did not downtime the individual servers, I'm assuming you / chris will do that when touching them [13:56:19] !log downtime and poweroff ms-be 21 26 27 37 38 39 before switch relocation - T148506 [13:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:36] PROBLEM - IPsec on mc2032 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc1014_v4 [13:56:45] gehel: is it possible to mass downtime servers? [13:56:48] <_joe_> this is expected, sadly ^^ [13:56:52] !log disabled instance creation on Horizon via https://gerrit.wikimedia.org/r/#/c/350414/ and on wikitech via a strategic edit in extensions/OpenStackManager/special/SpecialNovaInstance.php [13:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1026, depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350411 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [13:57:06] XioNoX, just click many times! :-) [13:57:06] PROBLEM - IPsec on mc2033 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc1015_v4 [13:57:18] there is a script, it is documented somewhere [13:57:22] or a for loop on "icinga-downtime" [13:57:33] <_joe_> XioNoX: there is the icinga-downtime script on tegmen you can use [13:57:36] PROBLEM - IPsec on mc2034 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc1016_v4 [13:57:41] * gehel is looking for that doc... [13:58:05] !log put labservices1001 into downtime to minimize (but probably not totally eliminate) alert spam [13:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:16] PROBLEM - IPsec on mc2035 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc1017_v4 [13:58:42] XioNoX: you can select multiple ones in the Icinga web UI under "Host detail" and then select "Schedule downtime for selected Host(s)" under "Command for checked hosts(s)" [13:58:59] XioNoX: I'm not finding docs, but as _joe_ said, "icinga-downtime" script on tegmen... [13:59:09] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1026, depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350411 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [13:59:12] thanks [13:59:16] PROBLEM - IPsec on mc2036 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc1018_v4 [13:59:18] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1026, depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350411 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [13:59:26] (03PS1) 10Ladsgroup: Set echoIcon for notification of wikibase in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350418 (https://phabricator.wikimedia.org/T142102) [13:59:47] !log lowered VRRP priority for T148506 [13:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:56] T148506: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506 [14:00:30] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Repool db1026, depool db1045 - T162539 T163548 (duration: 00m 53s) [14:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:39] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [14:00:39] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [14:03:46] PROBLEM - Host cp1072 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:07] !log "cr2-eqiad# set interfaces ae4 disable" done, (1 ping loss) - T148506 [14:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:22] cmjohnson1: you're good to proceed with the recabling anytime, (cf. https://phabricator.wikimedia.org/T148506#3205842 for details) [14:05:48] cmjohnson1: by recabling I mean the cr2-asw2 links [14:06:06] (03PS1) 10Giuseppe Lavagetto: role::memcached: swap out mc1013-18, in mc1031-36 [puppet] - 10https://gerrit.wikimedia.org/r/350420 [14:07:19] Can I deploy this now or I should wait until swat? https://gerrit.wikimedia.org/r/350418 Set echoIcon for notification of wikibase in test wikis [14:08:16] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp1072_v4, cp1072_v6 [14:08:26] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 connecting: cp1072_v4, cp1072_v6 [14:08:36] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp1072_v4, cp1072_v6 [14:09:05] (03CR) 10Giuseppe Lavagetto: [C: 032] role::memcached: swap out mc1013-18, in mc1031-36 [puppet] - 10https://gerrit.wikimedia.org/r/350420 (owner: 10Giuseppe Lavagetto) [14:09:12] (03PS1) 10Alexandros Kosiaris: ores: Add twemproxy support [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) [14:10:06] (03CR) 10jerkins-bot: [V: 04-1] ores: Add twemproxy support [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) (owner: 10Alexandros Kosiaris) [14:14:29] (03PS1) 10Giuseppe Lavagetto: site.pp: move mc1013-18 to spares, mc1031-36 into rotation [puppet] - 10https://gerrit.wikimedia.org/r/350422 (https://phabricator.wikimedia.org/T137345) [14:15:39] (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: move mc1013-18 to spares, mc1031-36 into rotation [puppet] - 10https://gerrit.wikimedia.org/r/350422 (https://phabricator.wikimedia.org/T137345) (owner: 10Giuseppe Lavagetto) [14:20:04] !log stop zookeeper on conf1003 for row-d maintenance (Hadoop, Kafka related) [14:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:16] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: (unnamed), cp2005_v6, cp2008_v6, cp2014_v6, cp2022_v6, cp3035_v6, cp3039_v6, cp3044_v6, cp3046_v6, cp3047_v6, cp3049_v6, cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6 not-conn: cp2002_v6, cp2011_v6, cp2017_v6, cp2020_v6, cp2024_v6, cp2026_v6, cp3034_v6, cp3036_v6, cp3037_v6, cp3038_v6, cp3045_v6, cp3048_v6 [14:20:26] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: (unnamed), cp2005_v6, cp2008_v6, cp2014_v6, cp2022_v6, cp2024_v6, cp3035_v6, cp3039_v6, cp3044_v6, cp3047_v6, cp3049_v6, cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6 not-conn: cp2002_v6, cp2011_v6, cp2017_v6, cp2020_v6, cp2026_v6, cp3034_v6, cp3036_v6, cp3037_v6, cp3038_v6, cp3045_v6, cp3046_v6, cp3048_v6 [14:20:27] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: (unnamed), cp2008_v6, cp3047_v6, cp4005_v6, cp4006_v6 not-conn: cp2002_v6, cp2005_v6, cp2011_v6, cp2014_v6, cp2017_v6, cp2020_v6, cp2022_v6, cp2024_v6, cp2026_v6, cp3034_v6, cp3035_v6, cp3036_v6, cp3037_v6, cp3038_v6, cp3039_v6, cp3044_v6, cp3045_v6, cp3046_v6, cp3048_v6, cp3049_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6 [14:22:16] RECOVERY - IPsec on mc2036 is OK: Strongswan OK - 1 ESP OK [14:26:29] !log depooling aqs100[69] from AQS for network maintenance [14:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:16] PROBLEM - Host cp1071 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:16] RECOVERY - Host cp1071 is UP: PING OK - Packet loss = 0%, RTA = 36.87 ms [14:29:47] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[redis-instance-tcp_6379] [14:30:46] RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:31:36] RECOVERY - IPsec on mc2032 is OK: Strongswan OK - 1 ESP OK [14:31:36] RECOVERY - IPsec on mc2034 is OK: Strongswan OK - 1 ESP OK [14:32:06] RECOVERY - IPsec on mc2033 is OK: Strongswan OK - 1 ESP OK [14:32:16] RECOVERY - IPsec on mc2035 is OK: Strongswan OK - 1 ESP OK [14:32:26] RECOVERY - IPsec on mc2031 is OK: Strongswan OK - 1 ESP OK [14:34:06] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics.wikimedia.org] [14:34:33] <_joe_> I guess the thorium puppet failure is expected [14:35:55] probably yes [14:47:30] !log bblack@neodymium conftool action : set/pooled=no; selector: dc=eqiad,cluster=cache_upload,name=cp107[1234].eqiad.wmnet [14:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:53] !log Stop MySQL db1070 (just in case) to test drac cold restart [14:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:55] Can anyone tell me who maintains the IRC echo bot rc-pmtpa for irc.wikimedia,org? [14:56:09] (03PS1) 10Nemo bis: Make mailman reject messages with high X-Spam-Score by default [puppet] - 10https://gerrit.wikimedia.org/r/350429 (https://phabricator.wikimedia.org/T161082) [14:56:16] 06Operations, 10ops-eqiad, 10DBA: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3214117 (10Marostegui) [14:56:20] 06Operations, 10ops-eqiad, 10DBA: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3097215 (10Marostegui) Leaving this documented for the future. I tried a cold reset locally, but it doesn't fix the remote issue. ``` root@db1070:~# bmc-device --debug --cold-reset =============================... [14:57:16] acagastya: Ops vaguely [14:57:19] Why, is it broken? [14:58:01] (03CR) 10Nuria: [C: 031] Add AutomatedRequest to schema black list [puppet] - 10https://gerrit.wikimedia.org/r/350235 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [14:58:06] On #en.wikinews, the bot notifies for a "Bot: Update Market Data" for a bot edit on es.wikinews [14:58:26] Interesting [14:58:55] More than a one off? [14:59:48] Has been doing for months. [15:00:14] Is that the only one that appears from another wiki? [15:00:28] Yes, it is. [15:00:42] The stock market update. [15:01:11] Can you file a bug in phabricator please? [15:01:17] With a full example irc line etc? [15:01:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] profile::redis::multidc_instance: separate concerns with confd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/350404 (owner: 10Giuseppe Lavagetto) [15:01:49] Well, I can, but I want to see the source code. [15:01:58] Sure [15:02:03] There's a few levels to it [15:02:08] There's the MW code for it [15:02:52] They live in https://github.com/wikimedia/mediawiki/tree/master/includes/rcfeed [15:05:08] https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org [15:05:14] I will have a look. [15:05:32] just trying to find where the "bot" code lives [15:05:36] Can't remember if it's just ircecho [15:05:59] Reedy ircecho as in what icinga-wm is using? [15:06:31] Reedy https://github.com/wikimedia/puppet/blob/64a94a86acb6e31672b40f2be34be951974422ed/modules/ircecho/files/ircecho [15:06:51] By the way, can you tell me what does pmtpa stands for? [15:06:57] indeed [15:07:05] powermedium tampa [15:07:15] it's a name of one of the old florida datacentres [15:08:12] https://github.com/wikimedia/puppet/blob/e959321aa620b77403cc9379db2e86080323c6e8/modules/mw_rc_irc/manifests/irc_echo.pp [15:08:12] https://github.com/wikimedia/puppet/blob/7aa41fed440e9aee61b19e400103ec9f98559b26/modules/profile/manifests/mw_rc_irc.pp [15:09:00] aha, it's udpmxircecho.py [15:09:23] paladox: indeed ircecho is the software powering icinga-wm [15:09:30] Yep [15:09:46] acagastya: The other source code is https://github.com/wikimedia/puppet/blob/production/modules/mw_rc_irc/files/udpmxircecho.py [15:09:50] I can't even remember how many IRC bot implementations we have anymore [15:10:06] maybe they 'll just converge and form skynet at some point [15:10:06] ETOOMANY [15:10:24] I, for one, welcome our new irc bot overlords [15:11:02] s/new/many/ [15:11:52] acagastya: https://wikitech.wikimedia.org/wiki/Pmtpa_cluster [15:12:50] !log switch ports for rack D7 and D8 configured - T148506 [15:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:58] D7: Testing: DO not merge - https://phabricator.wikimedia.org/D7 [15:12:58] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [15:12:59] T148506: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506 [15:13:12] (03PS2) 10Nemo bis: Make mailman reject messages with high X-Spam-Score by default [puppet] - 10https://gerrit.wikimedia.org/r/350429 (https://phabricator.wikimedia.org/T161082) [15:14:23] PROBLEM - nova-api process on labtestnet2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-api [15:14:33] PROBLEM - nova-api http on labtestnet2001 is CRITICAL: connect to address 10.192.20.5 and port 8774: Connection refused [15:15:23] RECOVERY - nova-api process on labtestnet2001 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/nova-api [15:15:33] RECOVERY - nova-api http on labtestnet2001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.008 second response time [15:17:32] Lot of links to follow. [15:19:03] Thanks for the help, and swift response! [15:22:23] PROBLEM - nova-api process on labtestnet2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-api [15:22:33] PROBLEM - nova-api http on labtestnet2001 is CRITICAL: connect to address 10.192.20.5 and port 8774: Connection refused [15:23:53] ACKNOWLEDGEMENT - nova-api http on labtestnet2001 is CRITICAL: connect to address 10.192.20.5 and port 8774: Connection refused andrew bogott Im breaking this stuff on purpose! [15:23:53] ACKNOWLEDGEMENT - nova-api process on labtestnet2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-api andrew bogott Im breaking this stuff on purpose! [15:24:31] !log Shutdown es2019 for maintenance with papaul and Dell - T149526 [15:24:34] jynus: ^ [15:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:40] T149526: es2019 crashed again - https://phabricator.wikimedia.org/T149526 [15:25:24] good [15:26:28] PROBLEM - MariaDB disk space on labsdb1001 is CRITICAL: DISK CRITICAL - free space: /srv 187814 MB (5% inode=99%) [15:27:27] known ^ ? [15:27:34] expected, rather [15:27:39] could be related to the alter table [15:28:12] I can take a look [15:28:30] it might be i think, the wb_terms table of wikidata [15:29:38] enwiki, 600MB [15:29:50] GB, I mean [15:29:54] pfffff [15:30:04] p50380g50816__pop_stats 530 GB [15:30:15] ah, no [15:30:19] only 53 GB [15:30:23] RECOVERY - nova-api process on labtestnet2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-api [15:30:28] well, not bad either [15:30:43] PROBLEM - Host cp1073 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:43] PROBLEM - Host cp1071 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:03] PROBLEM - Host cp1074 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:09] only 152G left [15:31:17] would that be enough? [15:32:01] in production it is 120G [15:32:07] and the alter has been running for a while now [15:32:09] so I would say yes [15:32:22] at 1% rocudb stops working [15:32:27] percent [15:32:28] ACKNOWLEDGEMENT - Host cp1074 is DOWN: PING CRITICAL - Packet loss = 100% Ema https://phabricator.wikimedia.org/T148506 [15:32:28] ACKNOWLEDGEMENT - Host cp1073 is DOWN: PING CRITICAL - Packet loss = 100% Ema https://phabricator.wikimedia.org/T148506 [15:32:28] ACKNOWLEDGEMENT - Host cp1072 is DOWN: PING CRITICAL - Packet loss = 100% Ema https://phabricator.wikimedia.org/T148506 [15:32:28] ACKNOWLEDGEMENT - Host cp1071 is DOWN: PING CRITICAL - Packet loss = 100% Ema https://phabricator.wikimedia.org/T148506 [15:32:58] mm [15:33:02] the table is on innodb [15:33:08] but are we using file per table there? [15:33:11] yes, I said thta the last time [15:33:23] !log "cr2-eqiad# delete interfaces ae4 disable" done, confirmed links and LACP are up [15:33:23] PROBLEM - nova-api process on labtestnet2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-api [15:33:25] maybe it is better to kill the alter, compress the table, and then run it again? [15:33:26] we are using tokudb horrible per table option [15:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:43] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [15:33:44] !log "cr2-eqiad# delete interfaces ae4 disable" done, confirmed links and LACP are up - T148506 [15:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:52] T148506: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506 [15:33:58] is it tokudb really?: ENGINE=InnoDB AUTO_INCREMENT=954208535 DEFAULT CHARSET=binary [15:34:07] no, [15:34:16] I mean that that is the option we use [15:34:19] ah [15:34:46] 204G free now [15:34:48] did you do something? [15:34:52] nope [15:35:07] we should get rid of /srvuserdata [15:35:11] and add it to the lvm [15:35:21] maybe a temporary query? [15:35:24] 06Operations, 10ops-codfw, 10DBA: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3214400 (10Papaul) Hi Papaul, Thank you for contacting Dell EMC Basic Server Support. This mail is with reference to the (Memory and CPU Issue) you had reported on your PowerEdge(R730XD). Please find... [15:35:33] ACKNOWLEDGEMENT - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] andrew bogott Disabled on purpose [15:35:36] now I receive the page [15:35:38] jynus: agreed [15:36:13] should we kill the alter anyways? [15:37:22] it has been running for 2 hours already [15:37:32] yeah [15:37:33] how much did it take elsewhere? [15:37:53] for a 160G slave it took around 12 hours [15:37:58] so given it is labs, let's make it 20 [15:38:28] but if we kill it we will likely kill the server? [15:38:50] 2 hours of the table locked and reverting [15:38:54] >2 hours [15:38:56] That is the eternal question when killing a big alter :( [15:39:04] yes, it will take 2 hours at least to roll back [15:39:23] as I said, the table in production is around 120G [15:39:26] so it should be enough for the alter [15:39:33] the issue is the alter + temporary big queries? [15:39:35] 06Operations, 10ops-codfw, 10DBA: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3214415 (10Papaul) Hi Papaul, I will get the motherboard and the memory module replaced at the same time but at the same time would like to request you to help me with the address of the location where you... [15:39:46] do we have a tmp table size? [15:40:09] tmp_table_size | 67108864 [15:40:34] no no [15:40:40] <_joe_> !log shutting down conf1003 T148506 [15:40:40] the wb_terms one [15:40:44] ah [15:40:44] 06Operations, 10ops-codfw, 10DBA: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3214416 (10Marostegui) Thanks Papaul! As per our chat, I have brought MySQL, ping me when you need it down again, [15:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:48] T148506: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506 [15:40:51] 06Operations, 06Labs: During labservices1001 failover fqdn changed from foo.project.eqiad.wmflabs to foo.eqiad.wmflabs - https://phabricator.wikimedia.org/T163823#3214418 (10Andrew) I still can't reproduce this, even creating the exact set of breakages that were present when the issue appeared in prod. My gue... [15:41:03] PROBLEM - Host mc1017 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:03] PROBLEM - Host mc1018 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:03] PROBLEM - Host mc1016 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:03] PROBLEM - Host mc1015 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:03] PROBLEM - Host mc1014 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:04] PROBLEM - Host mc1013 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:12] jynus: I was looking for it before but I didn't find it [15:41:18] 06Operations, 10DBA, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3214419 (10matmarex) \o/ [15:41:24] That is why I am wondering if we use file per table, or how tokudb hides it [15:41:33] RECOVERY - nova-api http on labtestnet2001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.006 second response time [15:42:23] RECOVERY - nova-api process on labtestnet2001 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/nova-api [15:42:31] <_joe_> uhm the downtime on those hosts just expired [15:42:37] <_joe_> let me input a longer one [15:43:20] (03CR) 10Nuria: [C: 04-1] Add AutomatedRequest to schema black list [puppet] - 10https://gerrit.wikimedia.org/r/350235 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [15:43:51] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:44:17] (03CR) 10Nuria: [C: 031] Add AutomatedRequest to schema black list [puppet] - 10https://gerrit.wikimedia.org/r/350235 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [15:46:07] let's leave it running [15:46:12] if it breaks, it breaks [15:46:40] the table there is 3 times smaller than in production (as per show table status, which is not accurated) so it migjt not be as big [15:47:06] 3 times less rows, i meant [15:47:06] that makes no sense [15:47:21] as in, I believe you [15:47:28] but that is probably not right [15:47:31] might be show table status being super innacurated as I said [15:51:21] (03PS1) 10Dzahn: Revert "openldap::mgmt: turn cross-validate-accounts into template" [puppet] - 10https://gerrit.wikimedia.org/r/350435 [15:51:39] (03CR) 10Jgreen: "This is far enough out of my usual area that I'm not comfortable signing off on a global mailman config change." [puppet] - 10https://gerrit.wikimedia.org/r/350429 (https://phabricator.wikimedia.org/T161082) (owner: 10Nemo bis) [15:52:03] 06Operations, 06Labs: During labservices1001 failover fqdn changed from foo.project.eqiad.wmflabs to foo.eqiad.wmflabs - https://phabricator.wikimedia.org/T163823#3214432 (10Andrew) a:05Andrew>03None [15:52:23] marostegui, https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=labsdb1001&var-network=eth0&from=now-12h&to=now [15:53:13] PROBLEM - Host ms-be1038 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:13] PROBLEM - Host ms-be1027 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:13] PROBLEM - Host ms-be1021 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:13] PROBLEM - Host ms-be1037 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:13] PROBLEM - Host ms-be1039 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:37] jynus: so it matches the time of the alter [15:54:21] 06Operations, 06Labs, 10wikitech.wikimedia.org: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3214444 (10Andrew) [15:54:32] so maybe not worth continuing? [15:54:48] we can try to kill it yes [15:55:16] 06Operations, 06Labs, 10wikitech.wikimedia.org: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3207227 (10Andrew) [15:55:24] jynus: cross your fingers [15:55:34] wait [15:55:40] waiting [15:56:00] it has plateaued [15:56:03] RECOVERY - Host ms-be1039 is UP: PING WARNING - Packet loss = 64%, RTA = 37.70 ms [15:56:03] RECOVERY - Host cp1071 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [15:56:03] RECOVERY - Host cp1072 is UP: PING OK - Packet loss = 0%, RTA = 36.04 ms [15:56:13] RECOVERY - Host cp1073 is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [15:56:17] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=labsdb1001&var-network=eth0&from=now-1h&to=now [15:56:37] tyes, I have been watching it lately and it is stuck on 133G available [15:56:43] RECOVERY - Host mc1017 is UP: PING OK - Packet loss = 0%, RTA = 36.81 ms [15:56:43] let's wait [15:56:53] and if it start again, I will kill it [15:57:19] and we need to look for a solution because otherwise once we start using this column in production labs will break replication :_( [15:57:22] oh no [15:57:23] RECOVERY - Host ms-be1037 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms [15:57:26] replicatino has failed [15:57:30] what? [15:57:31] that is why it stopped [15:57:33] RECOVERY - Host ms-be1027 is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms [15:57:33] RECOVERY - Host ms-be1038 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [15:57:33] :-) [15:57:37] kill it [15:57:48] Could not execute Write_rows_v1 event on table enwiktionary.pagelinks; Disk full (pagelinks) [15:57:54] great [15:57:54] ok [15:57:57] killing it [15:58:44] done [15:58:47] and space is back [15:58:50] 373G available [15:59:13] ok, restarting replication [15:59:13] RECOVERY - Host mc1015 is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms [15:59:13] RECOVERY - Host mc1013 is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms [15:59:28] RECOVERY - MariaDB disk space on labsdb1001 is OK: DISK OK [15:59:43] RECOVERY - Host mc1014 is UP: PING OK - Packet loss = 0%, RTA = 37.99 ms [15:59:49] I shouldn't have done that for s5 [15:59:56] right? [16:00:01] why not? [16:00:07] (03PS2) 10Dzahn: Planet: Remove not maintained blog from fr feed [puppet] - 10https://gerrit.wikimedia.org/r/350391 (owner: 10Dereckson) [16:00:07] there is no problem on doing it [16:00:15] because it will rapply the schema? [16:00:27] or it was not coming from replication? [16:00:35] nooo it was being done locally :) [16:00:41] ok ok [16:00:41] so no problem [16:00:42] much easier [16:00:44] RECOVERY - Host mc1016 is UP: PING OK - Packet loss = 0%, RTA = 38.10 ms [16:01:01] if we cannot do that, we definitely cannot do the others [16:01:31] (03CR) 10Dzahn: [C: 032] Planet: Remove not maintained blog from fr feed [puppet] - 10https://gerrit.wikimedia.org/r/350391 (owner: 10Dereckson) [16:01:48] yes, it is something we need to fix [16:01:58] because otherwise we cannot do big alters [16:02:06] we can try to merge srvuserdata [16:02:10] and that would give us plenty of room [16:02:45] (03PS2) 10Dzahn: Revert "openldap::mgmt: turn cross-validate-accounts into template" [puppet] - 10https://gerrit.wikimedia.org/r/350435 [16:02:49] jouncebot: now [16:02:49] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [16:02:53] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: (unnamed), cp2002_v6, cp2005_v6, cp2008_v6, cp2011_v6, cp2014_v6, cp2017_v6, cp2020_v6, cp2022_v6, cp2024_v6, cp2026_v6, cp3034_v6, cp3035_v6, cp3036_v6, cp3037_v6, cp3038_v6, cp3039_v6, cp3044_v6, cp3045_v6, cp3046_v6, cp3047_v6, cp3048_v6, cp3049_v6, cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6 [16:03:48] oh yeah 16:00 UTC was the *former* morning SWAT window time [16:04:06] dammit I use pwstore so seldom that I have to relearn how to decrypt a password every time [16:04:21] andrewbogott: pws ed [16:04:34] that errors out for me [16:05:04] oh, maybe I know what's happening [16:05:54] it cant find your private key ? [16:07:17] <_joe_> !log disabled and masked strongswan, memcached, redis on mc1013-17 for decommissioning [16:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:32] mutante: no, I was missing /Users/andrew/.pws-trusted-users [16:08:30] woo now it says [16:08:32] https://www.irccloud.com/pastebin/2mZAjYwN/ [16:11:45] mutante: any ideas? [16:12:49] the upstream source for pws seems to be gone now [16:13:55] moritzm: have a minute to help me get pws working? It's safe to assume that I've broken everything that it is possible to break (due to a deletion and partial restore of my homedir) [16:14:04] If there are docs about how to install things from scratch, I can't find them :( [16:14:46] andrewbogott: hmm.. first idea would be to delete it and pull it from scratch [16:14:57] !log stop and mask cassandra and restbase on restbase-dev1003 for row-d maintenance [16:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:10] can't pull it because fatal: repository 'https://code.google.com/p/pwstore/' not found [16:15:30] andrewbogott: the docs are on office wiki https://office.wikimedia.org/wiki/Pwstore#Checking_out_the_pwstore_repository [16:15:42] ah, ok, I was looking on wikitech [16:15:45] andrewbogott: try deleting and git pull it from scratch from neodymium [16:15:54] that I did already, no change [16:16:30] andrewbogott: how do the permissions look on wikitech-static? is it actually not readable? [16:16:35] they're fine, I can cat [16:17:03] i wonder how you got the "code.google.com" part in there [16:17:17] try deleting the actual "pws" program and re-download from https://github.com/weaselp/pwstore/ ? [16:17:19] that's for the pws tool itself [16:17:22] not the pwstore [16:17:29] yea, try using that github source [16:17:49] ah, there it is. Too many different tools called pwstore [16:17:53] RECOVERY - Host ms-be1021 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [16:17:53] aha [16:18:04] mutante: same error with the new checkout [16:18:31] The "probably not readable".. ? uhm... [16:18:34] (03CR) 10Jgreen: "Also see https://phabricator.wikimedia.org/T58525 which has a discussion of why we've historically handled this on a per-list basis instea" [puppet] - 10https://gerrit.wikimedia.org/r/350429 (https://phabricator.wikimedia.org/T161082) (owner: 10Nemo bis) [16:18:35] yeah [16:18:45] probably something messed up with my keys, or my trusted-users... [16:18:53] PROBLEM - swift eqiad-prod container availability on graphite1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [88.0] [16:18:56] I feel like I should be able to just do 'gpg —decrypt' as well but that als fails [16:19:18] abartov: it's on office wiki: https://office.wikimedia.org/wiki/Pwstore [16:19:18] andrewbogott: ah, interesting, yea, that's a good test, let's just use gpg and forget pwstore for a moment then [16:19:29] ah, Daniel already posted that :-) [16:19:50] andrewbogott: so your home dir was restored? maybe you have to run gpg --import [16:19:56] gpg —import [16:20:00] ok, will try [16:20:10] with the pass to the keychain [16:20:10] let me make sure my keys restored. They're in ~/.gpg? [16:20:20] by default, but they don't have to [16:21:43] gpg --import ~/.gpg/secring.gpg i'd try it ... but i still find "probably not readable" a weird message. it's not "cant find key" [16:22:22] ok, in my case it seems to be ~/.gnupg [16:22:26] which, that dir is present at least [16:22:36] sorry, yea, .gnupg [16:22:48] andrewbogott: when you have the checkout and if your cwd is in the pwstore dir, try "pws update-keyring", it should fetch all the keys as listed in .users [16:22:53] do you have a secring.gpg and pubring.gpg ? [16:23:14] secring.gpg is empty. Let me try a restore of that dir [16:24:05] i had to reinstall recently and i copied my secring.gpg back from another place so it wasn't in the default location.. and yea, i did the --import thing [16:24:20] ok! [16:24:22] everything works now [16:24:23] (03PS1) 10DatGuy: Enable local uploads on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350439 (https://phabricator.wikimedia.org/T133137) [16:24:24] :) [16:24:28] I just had a bad/partial restore of ~/.gnupg [16:24:37] it had all the files but some were empty for no apparent reason [16:24:52] ok, cool [16:24:54] now I just have to remember why I cared... [16:25:30] (03PS1) 10Paladox: Test: DO NOT MERGE [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [16:25:53] (03CR) 10Paladox: [C: 04-1] Test: DO NOT MERGE [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 (owner: 10Paladox) [16:26:58] well, shit, this password doesn't work [16:27:19] mutante: do you mind double-checking my work? See if you can get a shell on wikitech-static? [16:27:53] (looks like you touched it last) [16:28:53] (03CR) 10Nemo bis: "I know that discussion, but we've not had any real progress in years so we'd better start somewhere." [puppet] - 10https://gerrit.wikimedia.org/r/350429 (https://phabricator.wikimedia.org/T161082) (owner: 10Nemo bis) [16:30:49] (03PS1) 10Dereckson: Disable collectionsaveascommunitypage right on es.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350442 (https://phabricator.wikimedia.org/T163767) [16:32:57] (03PS2) 10Paladox: Test: DO NOT MERGE [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [16:33:06] (03CR) 10Paladox: [C: 04-1] Test: DO NOT MERGE [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 (owner: 10Paladox) [16:33:36] I should i have set that to draft. Have no idea how to do that on the command line. [16:33:46] PROBLEM - swift eqiad-prod object availability on graphite1001 is CRITICAL: CRITICAL: 26.67% of data under the critical threshold [90.0] [16:34:13] 06Operations, 06Labs, 10wikitech.wikimedia.org: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3214674 (10Andrew) Maintenance tasks are just: - Keep packages up to date - Keep MW up to date [16:34:35] !log mobrovac@naos Started restart [citoid/deploy@b8c4cb2]: Restart for ICU lib update [16:34:36] <_joe_> paladox: git review -D [16:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:57] * paladox dosent use git review, i use plain git :) [16:35:17] <_joe_> paladox: heh then no idea, I'd have to check git-review sources or the gerrit docs [16:35:27] i will check gerrit docs. [16:36:00] ah [16:36:01] !log mobrovac@naos Started restart [cxserver/deploy@6899032]: Restart for ICU lib update [16:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:20] replace refs/for/master with refs/drafts/master but too late now. Will have to remeber that in future. [16:37:03] !log mobrovac@naos Started restart [electron-render/deploy@9156760]: Restart for ICU lib update [16:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:28] !log mobrovac@naos Started restart [eventstreams/deploy@05bcc8f]: Restart for ICU lib update [16:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:16] !log mobrovac@naos Started restart [graphoid/deploy@128206b]: Restart for ICU lib update [16:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:26] PROBLEM - pdfrender on scb1004 is CRITICAL: connect to address 10.64.48.29 and port 5252: Connection refused [16:42:23] on it ^ [16:43:10] (03PS12) 10Paladox: Gerrit: Use the mariadb plugin instead of mysql [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) [16:43:40] !log mobrovac@naos Started restart [electron-render/deploy@9156760]: Restart for ICU lib update [16:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:28] (03CR) 10Muehlenhoff: [C: 031] Revert "openldap::mgmt: turn cross-validate-accounts into template" [puppet] - 10https://gerrit.wikimedia.org/r/350435 (owner: 10Dzahn) [16:44:52] (03PS2) 10Giuseppe Lavagetto: profile::redis::multidc_instance: separate concerns with confd [puppet] - 10https://gerrit.wikimedia.org/r/350404 [16:45:46] PROBLEM - pdfrender on scb2002 is CRITICAL: connect to address 10.192.48.43 and port 5252: Connection refused [16:45:46] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [16:47:43] (03CR) 10Reedy: "Won't the normal mysql driver work fine with mariadb? (I'm fairly sure on my mix of dev machines for other languages, I have the normal my" [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [16:47:46] RECOVERY - swift eqiad-prod container availability on graphite1001 is OK: OK: Less than 1.00% under the threshold [92.0] [16:47:46] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.075 second response time [16:49:02] (03CR) 10Paladox: "> Won't the normal mysql driver work fine with mariadb? (I'm fairly" [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [16:51:26] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.080 second response time [16:51:39] (03CR) 10Reedy: "If Jaime has said he's going to package it at a later date... Shouldn't this be abandoned?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [16:52:07] (03CR) 10Paladox: "Im using this locally + gerrit 2.14 now packages this lib so it's like mysql now." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [16:55:46] RECOVERY - pdfrender on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [16:56:28] andrewbogott: sorry, i was afk. i just tried, and no, i can't get a shell on wikitech-static [16:56:58] PROBLEM - Host analytics1037 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:58] PROBLEM - Host analytics1043 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:58] PROBLEM - Host restbase-dev1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:58] PROBLEM - Host analytics1067 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:58] PROBLEM - Host analytics1044 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:59] PROBLEM - Host analytics1045 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:59] PROBLEM - Host analytics1068 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:00] PROBLEM - Host kafka1020 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:00] PROBLEM - Host analytics1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:01] PROBLEM - Host analytics1036 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:01] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [16:57:02] PROBLEM - MariaDB Slave IO: s4 on db1091 is CRITICAL: CRITICAL slave_io_state could not connect [16:57:02] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [16:57:03] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [16:57:03] PROBLEM - MariaDB Slave IO: s4 on db1040 is CRITICAL: CRITICAL slave_io_state could not connect [16:57:04] PROBLEM - Check whether ferm is active by checking the default input chain on es1019 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:57:04] PROBLEM - mysqld processes on db1040 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:57:05] PROBLEM - Check systemd state on es1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:57:07] mutante: does that mean there was a transcription error in the password store and we can never access that machine ever again? [16:57:10] PROBLEM - mysqld processes on es1019 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:57:10] PROBLEM - MariaDB Slave IO: s7 on db1094 is CRITICAL: CRITICAL slave_io_state could not connect [16:57:10] PROBLEM - MariaDB Slave Lag: es3 on es1019 is CRITICAL: CRITICAL slave_sql_lag could not connect [16:57:10] PROBLEM - MariaDB Slave IO: es3 on es1019 is CRITICAL: CRITICAL slave_io_state could not connect [16:57:10] PROBLEM - MariaDB Slave SQL: es3 on es1019 is CRITICAL: CRITICAL slave_sql_state could not connect [16:57:16] PROBLEM - mysqld processes on db1094 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:57:16] PROBLEM - MariaDB Slave IO: s5 on db1092 is CRITICAL: CRITICAL slave_io_state could not connect [16:57:16] PROBLEM - Host analytics1035 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:16] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table strategyappswiki.watchlist doesnt exist on query. Default database: strategyappswiki. [Query snipped] [16:57:16] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-yarn-nodemanager] [16:57:17] PROBLEM - Host analytics1042 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:17] PROBLEM - Host kafka1018 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:18] what the hell [16:57:24] all downtimes expired? [16:57:26] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 26 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-yarn-nodemanager] [16:57:26] PROBLEM - Hadoop NodeManager on analytics1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:57:28] PROBLEM - MariaDB Slave Lag: s4 on db1091 is CRITICAL: CRITICAL slave_sql_lag could not connect [16:57:33] PROBLEM - mysqld processes on db1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:57:36] seems so [16:57:39] PROBLEM - mysqld processes on db1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:57:39] PROBLEM - MariaDB Slave SQL: s4 on db1091 is CRITICAL: CRITICAL slave_sql_state could not connect [16:57:39] PROBLEM - MariaDB Slave IO: s6 on db1093 is CRITICAL: CRITICAL slave_io_state could not connect [16:57:44] those are coming back from downtime :( [16:57:45] PROBLEM - Zookeeper Server on conf1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg [16:57:50] PROBLEM - mysqld processes on db1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:57:50] PROBLEM - MariaDB Slave Lag: s4 on db1040 is CRITICAL: CRITICAL slave_sql_lag could not connect [16:57:50] PROBLEM - MariaDB Slave SQL: s7 on db1094 is CRITICAL: CRITICAL slave_sql_state could not connect [16:57:50] PROBLEM - MariaDB Slave SQL: s5 on db1092 is CRITICAL: CRITICAL slave_sql_state could not connect [16:57:50] PROBLEM - MariaDB Slave Lag: s5 on db1092 is CRITICAL: CRITICAL slave_sql_lag could not connect [16:57:51] PROBLEM - MariaDB Slave SQL: s6 on db1093 is CRITICAL: CRITICAL slave_sql_state could not connect [16:57:51] PROBLEM - MariaDB Slave Lag: s6 on db1093 is CRITICAL: CRITICAL slave_sql_lag could not connect [16:57:52] PROBLEM - MariaDB Slave Lag: s7 on db1094 is CRITICAL: CRITICAL slave_sql_lag could not connect [16:58:00] PROBLEM - MariaDB Slave SQL: s4 on db1040 is CRITICAL: CRITICAL slave_sql_state could not connect [16:58:00] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-yarn-nodemanager] [16:58:00] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [16:58:01] PROBLEM - Hadoop NodeManager on analytics1040 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:58:02] PROBLEM - Hadoop NodeManager on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:58:10] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 19 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[zookeeper] [16:58:10] PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:58:12] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 21 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-yarn-nodemanager] [16:58:14] hmmm [16:58:24] those are not D2 [16:58:25] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [16:58:50] andrewbogott: i don't know. or that the root password was changed or password auth was disabled in favor of keys [16:59:13] ottomata: I silenced all the kafka nodes, 1002,etc.. not sure why they are alarming [16:59:29] andrewbogott: if all else fails i guess there is some way to boot into rescue mode [16:59:30] RECOVERY - MariaDB Slave IO: s5 on db1092 is OK: OK slave_io_state Slave_IO_Running: Yes [16:59:46] RECOVERY - mysqld processes on db1091 is OK: PROCS OK: 1 process with command name mysqld [16:59:49] elukey: those are not in your list i think [16:59:51] the db pages are also row d related? marostegui jynus ? [16:59:52] RECOVERY - mysqld processes on db1092 is OK: PROCS OK: 1 process with command name mysqld [16:59:52] RECOVERY - MariaDB Slave SQL: s5 on db1092 is OK: OK slave_sql_state Slave_SQL_Running: Yes [16:59:58] not rack D-2 [17:00:06] godog: I am getting them up [17:00:12] RECOVERY - MariaDB Slave IO: s4 on db1091 is OK: OK slave_io_state Slave_IO_Running: Yes [17:00:25] marostegui: ah ok, thanks! [17:00:45] there are no more downtimes [17:00:46] !log mobrovac@naos Started restart [mathoid/deploy@7eb4092]: Restart for ICU lib update [17:00:48] RECOVERY - mysqld processes on db1093 is OK: PROCS OK: 1 process with command name mysqld [17:00:49] RECOVERY - MariaDB Slave SQL: s4 on db1091 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:00:49] RECOVERY - MariaDB Slave IO: s6 on db1093 is OK: OK slave_io_state Slave_IO_Running: Yes [17:00:49] RECOVERY - MariaDB Slave SQL: s6 on db1093 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:00:49] I checked in icinga [17:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:02] elukey: i'm confused, e.g. analytics1038 [17:01:10] OHHHH [17:01:11] d4 [17:01:12] ok ok [17:01:19] this is D4 short downtime [17:01:19] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [17:01:19] phew [17:01:20] ok [17:01:21] !log mobrovac@naos Started restart [mobileapps/deploy@5c2b9a9]: Restart for ICU lib update [17:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:32] RECOVERY - MariaDB Slave IO: es3 on es1019 is OK: OK slave_io_state Slave_IO_Running: Yes [17:01:39] its just the icinga downtime expiring? [17:01:48] RECOVERY - mysqld processes on es1019 is OK: PROCS OK: 1 process with command name mysqld [17:01:48] RECOVERY - MariaDB Slave SQL: es3 on es1019 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:01:59] ottomata: well D2 seems also down, and with no downtime [17:02:05] and IIRC I've set two days [17:02:05] !log mobrovac@naos Started restart [trending-edits/deploy@7112062]: Restart for ICU lib update [17:02:07] grrrr [17:02:07] hm [17:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:28] From the DB side it is all downtimes expiring [17:02:34] andrewbogott: https://support.rackspace.com/how-to/reset-your-server-password/ [17:03:07] I have disabled some alerts [17:03:12] I will be checking s3 [17:03:20] wathclist error on labs I think it was [17:03:38] dbstore1001 [17:03:38] yes [17:03:40] that i will fix it [17:03:48] you do? [17:04:01] I was going to check it [17:04:10] it is the watchlist thingy, so expected [17:04:29] ottomata: going to ack and add downtime for those hosts [17:05:32] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:05:33] mutante: thank you, I'll investigate [17:06:09] I am pretty sure I downtimed all the hosts a lot more than 3 hours... [17:06:10] 06Operations, 06Labs, 10wikitech.wikimedia.org: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3214809 (10Andrew) The root password to this host no longer works. Either someone fancied it up to use keys, or we need to do a rescue as per https... [17:08:01] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review, 15User-fgiunchedi: Implement DC-local cache failure limiter in Thumbor - https://phabricator.wikimedia.org/T151065#3214814 (10fgiunchedi) [17:09:33] marostegui: me too [17:10:09] elukey: yes, jynus and me are discussing that on databases, that some hosts look so on -databases [17:10:15] maybe we have lost downtimes? [17:12:22] RECOVERY - Host restbase-dev1003 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [17:12:42] RECOVERY - swift eqiad-prod object availability on graphite1001 is OK: OK: Less than 1.00% under the threshold [95.0] [17:14:42] PROBLEM - cassandra-a CQL 10.64.48.117:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.117 and port 9042: Connection refused [17:14:42] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.46, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f29cfb40950: Failed to establish a new connection: [Errno 111] Connection refused,)) [17:14:43] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [17:15:01] (03CR) 10Mobrovac: [C: 031] "We're GTG on this now" [puppet] - 10https://gerrit.wikimedia.org/r/345827 (https://phabricator.wikimedia.org/T159615) (owner: 10Alexandros Kosiaris) [17:15:12] PROBLEM - cassandra-b SSL 10.64.48.118:7001 on restbase-dev1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:15:12] PROBLEM - cassandra-a SSL 10.64.48.117:7001 on restbase-dev1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:15:12] PROBLEM - cassandra-b CQL 10.64.48.118:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.118 and port 9042: Connection refused [17:15:12] PROBLEM - Restbase root url on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 7231: Connection refused [17:15:13] PROBLEM - cassandra-a service on restbase-dev1003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive [17:15:22] PROBLEM - cassandra-b service on restbase-dev1003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive [17:16:29] urandom: --^ [17:16:32] PROBLEM - puppet last run on restbase-dev1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Service[cassandra-b],Service[cassandra-a],Service[restbase] [17:21:40] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:21:41] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:21:41] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:24:15] (03CR) 10Dzahn: [C: 032] Revert "openldap::mgmt: turn cross-validate-accounts into template" [puppet] - 10https://gerrit.wikimedia.org/r/350435 (owner: 10Dzahn) [17:28:26] (03CR) 10Dzahn: "done. on wasat it's using eqiad as it did before" [puppet] - 10https://gerrit.wikimedia.org/r/350435 (owner: 10Dzahn) [17:38:59] (03PS1) 10EBernhardson: Configure multimedia search template boosting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350452 (https://phabricator.wikimedia.org/T163223) [17:41:23] !log reedy@naos Synchronized wmf-config/InitialiseSettings.php: touch (duration: 01m 23s) [17:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:17] (03Draft1) 10Paladox: Gerrit: Fix bot so we make sure $uploader-username exists [puppet] - 10https://gerrit.wikimedia.org/r/350451 [17:43:19] (03PS2) 10Paladox: Gerrit: Fix bot since i forgot to rename a variable [puppet] - 10https://gerrit.wikimedia.org/r/350451 (https://phabricator.wikimedia.org/T161525) [17:43:43] (03PS3) 10Paladox: Gerrit: Fix bot since i forgot to rename a variable [puppet] - 10https://gerrit.wikimedia.org/r/350451 (https://phabricator.wikimedia.org/T161525) [17:44:41] !log unmasking and starting daemons on restbase-dev1003 [17:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:10] RECOVERY - cassandra-a service on restbase-dev1003 is OK: OK - cassandra-a is active [17:45:20] RECOVERY - cassandra-b service on restbase-dev1003 is OK: OK - cassandra-b is active [17:45:40] RECOVERY - puppet last run on restbase-dev1003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:45:50] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [17:46:00] RECOVERY - cassandra-a CQL 10.64.48.117:9042 on restbase-dev1003 is OK: TCP OK - 0.036 second response time on 10.64.48.117 port 9042 [17:46:10] RECOVERY - cassandra-b CQL 10.64.48.118:9042 on restbase-dev1003 is OK: TCP OK - 0.036 second response time on 10.64.48.118 port 9042 [17:46:10] RECOVERY - Restbase root url on restbase-dev1003 is OK: HTTP OK: HTTP/1.1 200 - 15500 bytes in 0.086 second response time [17:46:11] RECOVERY - cassandra-a SSL 10.64.48.117:7001 on restbase-dev1003 is OK: SSL OK - Certificate restbase-dev1003-a valid until 2018-01-05 22:53:09 +0000 (expires in 254 days) [17:46:11] RECOVERY - cassandra-b SSL 10.64.48.118:7001 on restbase-dev1003 is OK: SSL OK - Certificate restbase-dev1003-b valid until 2018-01-05 22:53:10 +0000 (expires in 254 days) [17:46:55] !log restart nutcracker on the eqiad mw hosts to pick up the new shard config (spamming elasticsearch memcached and triggering alarms) [17:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:55] (03CR) 10Paladox: "Im not 100% sure this will fix it. As i can't reproduce the problem it makes it harder on fixing the problem." [puppet] - 10https://gerrit.wikimedia.org/r/350451 (https://phabricator.wikimedia.org/T161525) (owner: 10Paladox) [17:48:03] elukey: late to the party (was at lunch), but it looks like everything is OK now [17:49:52] !log rebooting es1019 for upgrading and to fix race condition on services [17:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:54] (03PS1) 10EBernhardson: Adjust sistersearch against wikivoyage to require title matching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350456 (https://phabricator.wikimedia.org/T163547) [17:54:20] 06Operations, 10Traffic, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Prometheus varnish metric churn due to VCL reloads - https://phabricator.wikimedia.org/T150479#3215013 (10fgiunchedi) [17:55:54] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10ema) I currently can't SSH into any of the following hosts: cp1071, cp1072, cp1073 and cp1074. Presumably, this is due to today's main... [17:58:47] (03PS1) 10Jdlrobson: Workaround issue of overriding whitelist config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350459 (https://phabricator.wikimedia.org/T163114) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170426T1800). [18:00:04] Jamesofur, DatGuy, and ebernhardson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:01:24] \o [18:04:49] I can SWAT [18:05:55] DatGuy: Jamesofur ping me when you're around [18:06:13] here [18:06:20] ping thcipriani [18:06:36] thcipriani: here now :) [18:06:42] hello :) [18:08:09] Jamesofur: haven't updated SecurePoll before afaic recall, do you need me to run some scripts post-sync? Or does this just need to be merged and deployed? [18:09:20] thcipriani: for these ones you can just merge and deploy. They're all maintenance scripts (which I can run myself from terbium) with no messages etc [18:09:43] Jamesofur: okie doke, sounds good :) [18:10:37] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350439 (https://phabricator.wikimedia.org/T133137) (owner: 10DatGuy) [18:11:20] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp2014 is CRITICAL: connect to address 10.192.32.113 and port 3121: Connection refused [18:11:20] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp2014 is CRITICAL: connect to address 10.192.32.113 and port 3127: Connection refused [18:11:30] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp2014 is CRITICAL: connect to address 10.192.32.113 and port 3123: Connection refused [18:11:40] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp2014 is CRITICAL: connect to address 10.192.32.113 and port 3124: Connection refused [18:12:00] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp2014 is CRITICAL: connect to address 10.192.32.113 and port 3125: Connection refused [18:12:00] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp2014 is CRITICAL: connect to address 10.192.32.113 and port 80: Connection refused [18:12:10] if anyone has time while waiting for their swat am having some issues in https://phabricator.wikimedia.org/T163114 with a swat I did Monday [18:12:10] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp2014 is CRITICAL: connect to address 10.192.32.113 and port 3120: Connection refused [18:12:10] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp2014 is CRITICAL: connect to address 10.192.32.113 and port 3126: Connection refused [18:12:10] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp2014 is CRITICAL: connect to address 10.192.32.113 and port 3122: Connection refused [18:12:10] PROBLEM - Check systemd state on cp2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:12:15] that i'm trying to get to the bottom of [18:12:48] thcipriani: if you can touch InitialiseSettings and CommonSettings as part of your deploys it would rule out https://phabricator.wikimedia.org/T126306 being the problem... [18:12:50] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:13:44] thcipriani, tell me once it's on mwdebug1002 [18:13:51] ping* :P [18:13:58] (03Merged) 10jenkins-bot: Enable local uploads on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350439 (https://phabricator.wikimedia.org/T133137) (owner: 10DatGuy) [18:14:12] jdlrobson: I'll touch them, FWIW it looks like CS was last touched April 25th an 00UTC and IS was touched today at 17UTC [18:14:21] a quick spot check [18:14:33] thanks thcipriani so probably wont help much :/ [18:14:44] but we'll see [18:14:44] !log running alter table on all wikis of s3 T163912 [18:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:51] T163912: Convert unique keys into primary keys for some wiki tables on s3 - https://phabricator.wikimedia.org/T163912 [18:14:59] jdlrobson: yeah, won't hurt anything [18:15:52] (03CR) 10jenkins-bot: Enable local uploads on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350439 (https://phabricator.wikimedia.org/T133137) (owner: 10DatGuy) [18:16:00] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.000 second response time [18:16:00] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.004 second response time [18:16:01] !log start varnish-frontend on cp2014 [18:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:10] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.001 second response time [18:16:10] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.004 second response time [18:16:10] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.003 second response time [18:16:11] RECOVERY - Check systemd state on cp2014 is OK: OK - running: The system is fully operational [18:16:20] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.003 second response time [18:16:20] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.002 second response time [18:16:30] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.003 second response time [18:16:40] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.001 second response time [18:17:25] DatGuy: update is live on mwdebug1002 [18:17:29] check please [18:17:32] checking [18:20:04] looks good [18:20:35] DatGuy: ok, going live [18:23:01] !log thcipriani@naos Synchronized dblists/commonsuploads.dblist: SWAT: [[gerrit:350439|Enable local uploads on knwiki]] T133137 (duration: 01m 06s) [18:23:07] ^ DatGuy live everywhere [18:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:10] T133137: Local upload on Kannada Wikipedia - https://phabricator.wikimedia.org/T133137 [18:23:17] cheers [18:23:50] (03PS2) 10Jdlrobson: Workaround issue of overriding whitelist config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350459 (https://phabricator.wikimedia.org/T163114) [18:25:29] jynus: Do you think https://phabricator.wikimedia.org/T163801 is a resurfacing of https://phabricator.wikimedia.org/T162121? There is a fix for that rolling out this week (group0/1 today, wikipedias tomorrow). [18:26:39] I do not think it is resurfacing [18:26:49] it is the same origin [18:26:52] !log thcipriani@naos Synchronized php-1.29.0-wmf.21/extensions/SecurePoll: SWAT: [[gerrit:350443|Add voter scripts for board/fdc election 2017]] T163854 (duration: 01m 00s) [18:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:02] like probably a bot doing a mass recategory [18:27:02] T163854: Create voter lists for Board & FDC Elections 2017 - https://phabricator.wikimedia.org/T163854 [18:27:03] ^ Jamesofur there's wmf.21 sync'd [18:27:12] * thcipriani does wmf.20 [18:27:16] thanks! [18:27:33] Krinkle, but I am not sure we can do something about it without very deep changes [18:27:37] *not [18:28:16] job queue or database structure changes [18:28:44] (03PS3) 10Jdlrobson: Workaround issue of overriding whitelist config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350459 (https://phabricator.wikimedia.org/T163114) [18:28:45] or counting in a cache, persist every X times/seconds [18:30:05] jynus: It's mostly to lazy re-initialize a known stale value. It previously just "did it", causing conflicts when the same query happened from multiple requests. That was meant to be fixed by adding a lock so that only one of the concurrent requests will make the query. [18:30:08] Krinkle, the only quick solution I can think of is to reject high-rate edits [18:30:15] But it seems we now get lock timeouts [18:30:21] Which shouldn't result in an error like that [18:30:41] !log thcipriani@naos Synchronized php-1.29.0-wmf.20/extensions/SecurePoll: SWAT: [[gerrit:350444|Add voter scripts for board/fdc election 2017]] T163854 (duration: 00m 57s) [18:30:49] ^ Jamesofur and that's wmf.20 [18:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350452 (https://phabricator.wikimedia.org/T163223) (owner: 10EBernhardson) [18:31:17] (03PS4) 10Jdlrobson: Workaround issue of overriding whitelist config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350459 (https://phabricator.wikimedia.org/T163114) [18:31:49] i don't know, add the info you just said to the ticket [18:32:01] and I will think of solutions [18:32:10] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [18:32:11] but knowing it is not codfw-failover related [18:32:13] (03PS5) 10Jdlrobson: Workaround issue of overriding whitelist config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350459 (https://phabricator.wikimedia.org/T163114) [18:32:16] it can wait [18:32:25] thcipriani: perfect thanks much [18:32:30] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [18:32:41] (03Merged) 10jenkins-bot: Configure multimedia search template boosting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350452 (https://phabricator.wikimedia.org/T163223) (owner: 10EBernhardson) [18:32:50] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [18:32:53] (03CR) 10jenkins-bot: Configure multimedia search template boosting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350452 (https://phabricator.wikimedia.org/T163223) (owner: 10EBernhardson) [18:33:00] (03PS6) 10Jdlrobson: Workaround issue of overriding whitelist config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350459 (https://phabricator.wikimedia.org/T163114) [18:33:06] thcipriani is there space in the deploy window for one more.. ^? [18:33:16] I'm scared of Russians and Germans shouting at me :) [18:33:42] jdlrobson: sure, add to wikitech page please :) [18:33:58] ebernhardson: your config change is live on mwdebug1002, check please [18:34:01] Krinkle, or maybe we could fail silently the counts [18:34:05] (03CR) 10Tjones: [C: 04-1] "This is probably ready, but needs to wait for Elastic 5.3.1 to be deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [18:34:05] thcipriani: looking [18:34:54] oh hi, why I am receiving this message so often: "APIError: readonly: The database has been automatically locked while the slave database servers catch up to the master [readonlyreason:Waiting for 5 lagged database(s); help:See https://es.wikipedia.org/w/api.php for API usage [...]" [18:35:01] right now I wonder if the original behaviour wasn't better than the new one :-) [18:35:11] thcipriani: on it! [18:35:23] TabbyCat: Because something is lagging? :P [18:35:26] and we just reduce innodb_lock_wait timeout to a few seconds [18:35:49] I'm running my bot and it's aborted several times due to that [18:35:58] it's annoying [18:36:03] any solution? [18:36:04] thcipriani: looks reasonable [18:36:15] ebernhardson: ok, going live. [18:37:20] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3215135 (10ayounsi) This was due to T133387. Hosts that have an igmp-snooping membership don't receive IPv6 RA and thus don't have a default rout... [18:37:51] jynus: Well, it seems a logic error in the code, not database. We don't need 10 concurrent requests to update the row to the same value (-1), just one suffices. But it seems the locking isn't working right. [18:37:59] thcipriani: added [18:38:00] I'll add some log sto the task and ask Aaron [18:38:05] jdlrobson: thanks [18:38:26] !log thcipriani@naos Synchronized wmf-config/CirrusSearch-common.php: SWAT: [[gerrit:350452|Configure multimedia search template boosting]] T163223 (duration: 00m 53s) [18:38:28] TabbyCat, https://www.mediawiki.org/wiki/Manual:Maxlag_parameter [18:38:30] 06Operations, 06Labs: During labservices1001 failover fqdn changed from foo.project.eqiad.wmflabs to foo.eqiad.wmflabs - https://phabricator.wikimedia.org/T163823#3215139 (10chasemp) @Andrew looking at this from another angle let's say we don't know the conditions that caused all FQDN to suddenly omit project,... [18:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:37] T163223: Add manual configuration for OtherIndex template boosting - https://phabricator.wikimedia.org/T163223 [18:38:39] ^ ebernhardson first config change is now live [18:38:54] jynus: sí, I was thinking on that. maxlag 3 would be okay? [18:39:09] it says lower values are nicer [18:39:13] thcipriani: still looks good, thanks [18:39:23] (03PS2) 10Thcipriani: Adjust sistersearch against wikivoyage to require title matching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350456 (https://phabricator.wikimedia.org/T163547) (owner: 10EBernhardson) [18:39:30] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350456 (https://phabricator.wikimedia.org/T163547) (owner: 10EBernhardson) [18:39:57] TabbyCat, the important part is the waiting part [18:40:26] (03Merged) 10jenkins-bot: Adjust sistersearch against wikivoyage to require title matching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350456 (https://phabricator.wikimedia.org/T163547) (owner: 10EBernhardson) [18:40:37] (03CR) 10jenkins-bot: Adjust sistersearch against wikivoyage to require title matching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350456 (https://phabricator.wikimedia.org/T163547) (owner: 10EBernhardson) [18:40:44] Krinkle, but I would assume they are different requests, right? [18:40:55] each reducing the count by 1 [18:41:08] thcipriani: this one is configuration that isn't referenced yet, the patch that uses it wasn't quite ready but will probably go out tomorrow [18:41:13] but they come from different edits- I do not know, I am guessing here [18:41:27] ebernhardson: kk, so nothing to check sounds like :) [18:41:47] thcipriani: syntax errors fataling mediawiki-config should be about the only possibility, but i checked that :) [18:41:58] okie doke, going live :) [18:42:11] Krinkle, so my guess is that the locking is working right [18:42:41] there is just many edits on the same category at the same time [18:42:51] jynus: Hm.. perhaps. If it's -1 then those can't be combined with blind locking [18:43:00] there are ways [18:43:06] but not easy [18:43:10] But the re-compute count and the clear count queries can be combined [18:43:15] Right, yeah, everything is possbile [18:43:18] Log-based persistance [18:43:21] This is prolly a different query [18:43:22] brb [18:44:10] I think contentions is ok to happen [18:44:13] !log thcipriani@naos Synchronized wmf-config/CirrusSearch-common.php: SWAT: [[gerrit:350456|Adjust sistersearch against wikivoyage to require title matching]] T163547 (duration: 01m 11s) [18:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:21] T163547: Add a way to customize the query sent to a particular wiki when running crosswiki searches - https://phabricator.wikimedia.org/T163547 [18:44:24] ^ ebernhardson 2nd config change live [18:44:43] I think what we cannot have is lag- hence a fast solution of failing early by lowering the innodb_wait_time [18:45:17] ebernhardson: also cirrussearch change for wmf.21 is live on mwdebug1002, check please [18:46:01] thcipriani: checking [18:46:20] RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:46:50] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:47:40] RECOVERY - Hadoop NodeManager on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:47:47] thcipriani: looks reasonable [18:47:50] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:48:44] ebernhardson: ok, going live [18:49:10] RECOVERY - Hadoop NodeManager on analytics1040 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:49:50] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [18:50:57] !log thcipriani@naos Synchronized php-1.29.0-wmf.21/extensions/CirrusSearch: SWAT: [[gerrit:350453|Provide a way to blacklist a set of wikis for crosswiki search]] T163546 (duration: 01m 02s) [18:51:02] ^ ebernhardson live everywhere [18:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:04] T163546: Add the possibility to blacklist a particular set of wikis from the SiteMatrixInterwikiResolver - https://phabricator.wikimedia.org/T163546 [18:51:10] RECOVERY - Hadoop NodeManager on analytics1041 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:51:16] thcipriani: thanks! i'll keep an eye on things [18:51:20] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:51:23] ebernhardson: sounds good [18:51:48] jdlrobson: looks like you posted in evening swat :) [18:51:57] thcipriani: ahhh [18:51:58] i can move it [18:52:06] wikitext editing :( [18:52:11] (03PS7) 10Thcipriani: Workaround issue of overriding whitelist config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350459 (https://phabricator.wikimedia.org/T163114) (owner: 10Jdlrobson) [18:53:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350459 (https://phabricator.wikimedia.org/T163114) (owner: 10Jdlrobson) [18:53:39] jdlrobson: heh, and this is why I asked you to post the patch on wikitech :P [18:53:49] :) well played sir [18:53:50] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:54:32] (03Merged) 10jenkins-bot: Workaround issue of overriding whitelist config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350459 (https://phabricator.wikimedia.org/T163114) (owner: 10Jdlrobson) [18:55:02] jdlrobson: live on mwdebug1002, check please [18:55:11] thcipriani: on it.. [18:56:11] OMG IT WORKS [18:56:12] YES! [18:56:14] thanks Reedy [18:56:18] and thanks thcipriani for fitting it in [18:56:21] !log downgrading varnish back to 4.1.5-wm1 on all -wm2 hosts [18:56:25] sync away [18:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:35] jdlrobson: okie doke, going live :) [18:58:35] !log thcipriani@naos Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:350459|Workaround issue of overriding whitelist config variable]] T163114 (duration: 00m 53s) [18:58:39] Hi, can you also deploy https://gerrit.wikimedia.org/r/#/c/350442/ or ping me when you're done if you don't have time? [18:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:45] T163114: Regression: Fix config to disable related pages where it's not wanted - https://phabricator.wikimedia.org/T163114 [18:59:00] ^ jdlrobson live everywhere [18:59:33] hurray! [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170426T1900). [19:00:24] (03CR) 10jenkins-bot: Workaround issue of overriding whitelist config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350459 (https://phabricator.wikimedia.org/T163114) (owner: 10Jdlrobson) [19:00:32] twentyafterfour: you mind if I get out 1 last patch? [19:02:32] Dereckson: I think we have a few to get that one in [19:02:51] (03PS2) 10Thcipriani: Disable collectionsaveascommunitypage right on es.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350442 (https://phabricator.wikimedia.org/T163767) (owner: 10Dereckson) [19:03:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350442 (https://phabricator.wikimedia.org/T163767) (owner: 10Dereckson) [19:03:37] !log restaring varnish-frontend on cp2014 to downgrade [19:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:55] Okay, I'm adding it to the table [19:04:15] (03Merged) 10jenkins-bot: Disable collectionsaveascommunitypage right on es.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350442 (https://phabricator.wikimedia.org/T163767) (owner: 10Dereckson) [19:04:19] jynus: with maxlag:3 it's working fine, and pywikibot waits 8/9 secs. by default between edits [19:04:32] now I'll depart and the job will fail again [19:05:04] !log restarting varnish frontend and backend on cp3033 to downgrade [19:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:56] Dereckson: I just noticed this changes wmgUseContentTranslation, is that correct? [19:06:11] (03CR) 10jenkins-bot: Disable collectionsaveascommunitypage right on es.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350442 (https://phabricator.wikimedia.org/T163767) (owner: 10Dereckson) [19:06:57] No it's not [19:07:15] Fixing [19:07:18] thanks [19:08:10] PROBLEM - Varnish HTTP text-backend - port 3128 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3128: Connection refused [19:10:22] (03PS1) 10Dereckson: Reenable wmgUseContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350466 [19:10:50] thcipriani: sorry I was distracted in another channel [19:10:55] all clear now? [19:10:56] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350466 (owner: 10Dereckson) [19:11:01] ah I guess not [19:11:04] * twentyafterfour isn't in any hurry [19:11:05] twentyafterfour: :) [19:11:22] twentyafterfour: I'll ping you when I'm done, should just take a few minutes, sorry for the delay [19:12:25] (03Merged) 10jenkins-bot: Reenable wmgUseContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350466 (owner: 10Dereckson) [19:12:28] not a problem [19:12:34] (03CR) 10jenkins-bot: Reenable wmgUseContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350466 (owner: 10Dereckson) [19:13:14] Dereckson: changes should be live on mwdebug1002, check please if possible [19:13:39] Hi jdlrobson, we've twice had reports on French wiktionary captcha doesn't appear in mobile version, would you know if there is already a bug report opened for that and do you need we ask something to these users to help to debug the issue? [19:13:43] thcipriani: testing [19:14:28] jdlrobson: feel free to answer me in private, so we let the channel all clear for twentyafterfour train [19:14:40] thcipriani: works [19:14:50] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [19:14:54] Dereckson: ok, going live [19:17:05] !log thcipriani@naos Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:350442|Disable collectionsaveascommunitypage right on es.wikipedia]] T163767 (duration: 00m 49s) [19:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:14] T163767: Disable creating books in Wikipedia namespace in eswiki - https://phabricator.wikimedia.org/T163767 [19:17:17] ^ Dereckson live now [19:17:20] twentyafterfour: all clear [19:17:57] I don't know if its appropriate to ask here. I'm interested in contributing to Wikipedia.I'm a computer science student and my skills include Python,SQL,Jenkins,Puppet and linux.Please point me in a direction where i can learn and be a successful contributor.thanks :) [19:18:10] RECOVERY - Varnish HTTP text-backend - port 3128 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 178 bytes in 0.241 second response time [19:18:35] thcipriani: thanks for the deploy [19:18:49] Dereckson: np :) [19:18:54] (and to have noticed the revert commited) [19:19:12] * thcipriani doffs hat [19:20:39] mk__: https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker is a good place to start [19:22:41] mk__: this is not the best channel for newcomers though, this is where we do production deployments and monitor the wikimedia server clusters [19:22:53] !log initiating cumin-based restart of all varnish backends for cache_upload in codfw to downgrade from experimental package. 30 minute spacing, 10 hosts, ~5h to completion... [19:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:20] RECOVERY - Host kafka1020 is UP: PING WARNING - Packet loss = 73%, RTA = 36.08 ms [19:23:30] RECOVERY - Host analytics1045 is UP: PING OK - Packet loss = 0%, RTA = 37.26 ms [19:23:40] RECOVERY - Host analytics1067 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [19:23:40] RECOVERY - Host analytics1035 is UP: PING OK - Packet loss = 0%, RTA = 36.47 ms [19:23:40] RECOVERY - Host analytics1037 is UP: PING OK - Packet loss = 0%, RTA = 36.56 ms [19:23:40] RECOVERY - Host kafka1018 is UP: PING OK - Packet loss = 0%, RTA = 36.49 ms [19:23:40] RECOVERY - Host analytics1044 is UP: PING OK - Packet loss = 0%, RTA = 36.62 ms [19:23:41] RECOVERY - Host analytics1043 is UP: PING OK - Packet loss = 0%, RTA = 37.23 ms [19:23:41] RECOVERY - Host analytics1068 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [19:23:42] RECOVERY - Host analytics1002 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [19:23:44] RECOVERY - Host analytics1042 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [19:23:44] RECOVERY - Host analytics1036 is UP: PING OK - Packet loss = 0%, RTA = 36.88 ms [19:23:52] hi analytics :) [19:24:39] twentyafterfour Thanks for your help :) [19:24:51] !log begin deployment train: group1 wikis to 1.29.0-wmf.21 refs T161733 [19:24:58] mk__: no problem! [19:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:00] T161733: MW-1.29.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T161733 [19:25:47] mk__: there is an overview page on contributing at https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker The next question though is what in particular you are interested in. Also #wikimedia-dev is probably a better room than #wikimedia-operations, but it's a very noisy room. If there is some more specific area you are interested in you might find a room in https://meta.wikimedia.org/wiki/IRC/Channels#MediaWiki_and_tec [19:26:50] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static [19:27:10] (03PS1) 1020after4: group1 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350469 [19:27:12] (03CR) 1020after4: [C: 032] group1 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350469 (owner: 1020after4) [19:27:20] PROBLEM - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static [19:28:25] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3215305 (10ayounsi) Another issues was: > ## inactive: interfaces interface-range vlan-analytics1-d-eqiad activating the range solved the issue a... [19:28:31] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350469 (owner: 1020after4) [19:28:40] (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350469 (owner: 1020after4) [19:28:59] ebernhar|lunch & mk__: #wikimedia-tech might be a good alternative, less noisy irc channel to check out [19:31:18] !log twentyafterfour@naos rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.21 [19:31:18] syncing... [19:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:00] uhoh [19:32:02] Warning: onEnhancedChangesListModifyBlockLineData() expects exactly 4 parameters, 3 given in /srv/mediawiki/php-1.29.0-wmf.21/extensions/Flow/Hooks.php on line 538 [19:33:55] That sounds like 2017-04-26 20:10:46 [15:33:24] Warning: onEnhancedChangesListModifyBlockLineData() expects exactly 4 parameters, 3 given in /srv/mediawiki/php-1.29.0-wmf.21/extensions/Flow/Hooks.php on line 538 [19:34:40] RECOVERY - Host cp1074 is UP: PING OK - Packet loss = 0%, RTA = 36.90 ms [19:34:40] So we already have seen this today. [19:34:50] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [19:35:10] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [19:35:20] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [19:35:30] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [19:36:19] eddiegp: hmm... [19:37:35] twentyafterfour: Channel logs suggest it might have something to do with T163874 [19:37:36] T163874: Deleting a page on Wikidata broke related changes with enhanced rc - https://phabricator.wikimedia.org/T163874 [19:40:51] (03PS4) 10Smalyshev: [cirrus] Increase max field count for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 (owner: 10DCausse) [19:41:45] well the error is happening a lot so I'm rolling back [19:42:00] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [19:42:15] !log rolling back group1 to wmf.20 due to T163896 refs T161733 [19:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:26] T163896: Warning: onEnhancedChangesListModifyBlockLineData() expects exactly 4 parameters, 3 given in /srv/mediawiki/php-1.29.0-wmf.21/extensions/Flow/Hooks.php on line 538 - https://phabricator.wikimedia.org/T163896 [19:42:27] T161733: MW-1.29.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T161733 [19:42:57] (03PS1) 1020after4: group1 wikis to 1.29.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350470 [19:42:59] (03CR) 1020after4: [C: 032] group1 wikis to 1.29.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350470 (owner: 1020after4) [19:44:58] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350470 (owner: 1020after4) [19:45:34] !log twentyafterfour@naos rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.20 [19:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:42] (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350470 (owner: 1020after4) [19:46:20] PROBLEM - Host cp1074 is DOWN: PING CRITICAL - Packet loss = 100% [19:46:40] RECOVERY - Host cp1074 is UP: PING OK - Packet loss = 0%, RTA = 36.79 ms [19:46:46] paladox: hi. can you tell me the difference between these? "#if ( $uploader-username != $author-username )" and "#if ( $uploader-name != $author-name )" [19:46:59] one is username and the other is name [19:47:17] so for some users they can have a different name and username ? [19:47:28] while for other users they are the same [19:47:34] name is the one you set when you do a git upload, ie for example your username in gerrit would be Git but on your localhost you like to call your self Git Hello. [19:47:39] git hello would be the name [19:47:50] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (69885 200000s) [19:48:19] mutante i am going to try something else. /me makes a new change to the patch. [19:48:24] uhm... ok.. does this explain why for Ladsgroup the bot doesnt work while for others it has no issues? [19:48:30] RECOVERY - are wikitech and wt-static in sync on labtestweb2001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (69885 200000s) [19:48:36] paladox: ok, thx [19:50:00] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3215394 (10ayounsi) Because of miss-communication, the move of server uplinks from asw-d to asw2-d didn't happen today. We are rescheduling it a... [19:50:12] !log restart kafka nodes (kafka1018 and kafka1020) after network maintenance [19:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:25] (03PS4) 10Paladox: Gerrit: Fix bot since i forgot to rename a variable [puppet] - 10https://gerrit.wikimedia.org/r/350451 (https://phabricator.wikimedia.org/T161525) [19:50:29] mutante done :) [19:50:46] paladox: after looking at it again i don't see how it could cause the bot not to send ANY message, the if/then/else just changes the content of the message [19:51:01] looks at PS4 [19:51:01] Yeh me too [19:51:05] which is strange [19:51:49] can you see what exactly the $name and $username are for Ladsgroup [19:51:57] compared to yours [19:52:10] Well mine is Paladox Paladox [19:52:13] is there whitespace? [19:52:30] mutantes strange is when i tryed to do it on one of ladsgroup change, it wouldent work for me [19:52:33] either [19:53:11] ooh.. so as if it's not related to the user but to the specific change?? [19:53:28] is it the repo he is using ? [19:53:33] that is different [19:54:18] Hmm, not sure. I tryed to remove Bug: then re add it to see if it would work for me but it didnt [19:54:30] I doint think it's the repo as uploading a test change worked for me [19:54:53] by test change i mean a complety new patch [19:55:06] but the bot is using a deprecated phabricator conduit call so im not sure if it's that [19:55:37] ok, also, it DOES send a message when it gets merged [19:55:45] just not when it's uploaded.. what is the diff there [19:55:55] Oh, havent tryed merging. [19:56:05] see example https://phabricator.wikimedia.org/T151194#3134489 [19:56:26] that's a notification about merge on the same ticket where the upload did not show up [19:56:43] (03CR) 10Smalyshev: [C: 031] [cirrus] Increase max field count for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 (owner: 10DCausse) [19:57:06] oh [19:57:52] @ebernhar @twentyafterfour Thanks for your reply :) [19:57:57] it's not about "Bug: T1234" vs. "Bug:T1234" riiight? [19:57:58] T1234: Restrict Bugzilla access to read-only - https://phabricator.wikimedia.org/T1234 [19:57:59] mutante merged changes uses $submitter-name [19:58:30] Not sure [19:59:06] I've seen changes do Bug:T1 for example (not actually saw T1 but using as an example) without a space but unsure if phabricator picked the change up [19:59:07] T1: Get puppet runs into logstash - https://phabricator.wikimedia.org/T1 [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170426T2000). Please do the needful. [20:00:15] Ready with an ORES deploy [20:00:59] Got a remote host id warning for deployment.eqiad.wmnet [20:01:01] Looking into that [20:01:40] paladox: that $submitter-name part would explain the difference between upload and merge, yea.. so back to the variable names there. we can merge your change as it is.. or we could just remove the entire "if" and always output ALL of it "name" "username", "submittername" whatever.. at least for debug [20:01:41] Oh. weird. I'm getting redirected to "naos.codfw.wmnet" [20:02:17] yes, that's the deployment host [20:02:20] halfak: that's the current deployment server [20:02:26] where were you expecting? [20:02:30] since we are in codfw but mira is broken [20:02:46] apergos, was expecting to get an eqiad machine when using an eqiad URL [20:02:54] ah. nope, one depoyment host to rule them all [20:02:59] :) [20:02:59] mutante i guess so, though i was thinking that users may find by Test; owner: Test Test may find it confusing [20:03:01] OK [20:03:11] i have been thinking this before as well.. the .eqiad part in it [20:03:21] maybe it should just be deployment.wmnet.. dunno [20:03:26] But the bot not working for someone is more priority then it being a little confussing. So i will remove the if part [20:03:47] (03PS5) 10Paladox: Gerrit: Fix bot since [puppet] - 10https://gerrit.wikimedia.org/r/350451 (https://phabricator.wikimedia.org/T161525) [20:03:53] someday we'll have multiple dcs live for various services but that's a good ways away [20:04:10] fun story. it looks like all of my agent forwarding is broken. [20:04:17] deployment..wmnet [20:04:20] Guess I'm fixing that now too :) [20:04:22] heh [20:05:03] agent forwarding would be disallowed by prod sshd config [20:05:32] soy templates will hopefully fix this :), though thats not implemented yet :) [20:05:43] by that i mean its-base still uses velocity [20:06:53] paladox: ok! looks like you lost part of the commit message [20:06:53] mutante i think hashar saw some kind of error though i carn't remember the error or if it was with its-phabricator. [20:07:13] mutante yetp as the forgot to rename variable no longer needs to be said as we removed the if part :) [20:07:36] halfak: not sure if you meant that but production SSHDs will not allow agent forwarding [20:08:29] mutante, oh yeah. I just remembered that I'd make a specific ssh key just for deployments. [20:08:49] (03PS6) 10Paladox: Gerrit: Fix bot by removing if part [puppet] - 10https://gerrit.wikimedia.org/r/350451 (https://phabricator.wikimedia.org/T161525) [20:08:56] mutante done, i've updated commit msg [20:09:18] halfak: ah, that is loaded by the "keyholder" then [20:09:30] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team: setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3215445 (10RobH) [20:09:42] it's running on the deployment server and it should be "armed" (loaded with the deployment keys) [20:09:44] 06Operations, 10Phabricator, 06Release-Engineering-Team: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#3215465 (10RobH) [20:09:45] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#2991469 (10RobH) 05Open>03Resolved Resolving this task, as there is now a setup task for the system. [20:10:24] [naos:~] $ sudo keyholder status [20:10:30] keyholder-agent: active [20:11:29] it has several keys loaded, one for each deployment target, https://wikitech.wikimedia.org/wiki/Keyholder [20:11:40] mutante, how do I put my keys in keyholder? [20:11:42] oh docs :) [20:12:39] !log halfak@naos Started deploy [ores/deploy@cc12103]: T162892 [20:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:48] T162892: Deploy ORES mid-April - https://phabricator.wikimedia.org/T162892 [20:13:02] It's still "mid-April" right? [20:13:02] ;) [20:13:19] to add a completely new identity to that.. not 100% sure yet, but it will need changes in private repo [20:13:33] i mean i know the keys are there.. but there will be other places it needs as well [20:13:50] heh @ mid :) [20:14:24] mutante, looks like this is harder than just making a custom key just for deployments and using that. [20:14:24] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [20:14:30] wah?! [20:14:35] sigh [20:16:05] halfak: yea, you need ops/releng to add a a new "deployment identity" for ORES afaict. new key in private repo, then some config changes [20:16:26] halfak: quick phab ticket should make it happen though [20:16:49] canary looks good. Continuing deploy. [20:17:08] mutante, will the deploy key be shared among deployers on the team or should we all have a separate one? [20:17:16] we all == me an Amir1 [20:17:49] (03PS1) 10RobH: phab1001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/350476 [20:17:51] halfak: it would be one per team, called "oresdeploy" (like existing analytics-deploy, mwdeploy,..) [20:18:15] like the table on https://wikitech.wikimedia.org/wiki/Keyholder [20:18:30] in the "identity" column [20:18:38] Ahh yeah. I had no idea what was in that table :) [20:19:15] mutante, how would a task for this be tagged? [20:19:21] I'm hoping to find one to use as a template. [20:19:24] (03PS2) 10Krinkle: Enable $wgEnableWANCacheReaper for testwiki and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339245 (owner: 10Aaron Schulz) [20:19:33] (03PS3) 10Krinkle: Enable $wgEnableWANCacheReaper for testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339245 (owner: 10Aaron Schulz) [20:19:37] (03CR) 10Krinkle: [C: 04-1] Enable $wgEnableWANCacheReaper for testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339245 (owner: 10Aaron Schulz) [20:20:03] !log bsitzmann@naos Started deploy [mobileapps/deploy@b5afcb8]: Update mobileapps to 14bd4a5 [20:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:08] halfak: add tag "Operations" and i think "Deployment-Systems" would make sense [20:22:25] 06Operations, 10Ops-Access-Requests, 10Deployment-Systems: Enable keyholder for ORES deployments - https://phabricator.wikimedia.org/T163939#3215512 (10Halfak) [20:22:26] https://phabricator.wikimedia.org/T163939 [20:22:45] also see https://phabricator.wikimedia.org/T133211 (just found it myself first time) [20:23:47] (03CR) 10RobH: [C: 032] phab1001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/350476 (owner: 10RobH) [20:24:28] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3215530 (10RobH) [20:25:05] 06Operations, 10Ops-Access-Requests, 10Deployment-Systems: Enable keyholder for ORES deployments - https://phabricator.wikimedia.org/T163939#3215512 (10Dzahn) I think what needs to happen here is adding a new "identity" for ORES deployments. As in the Identity column in the table on https://wikitech.wikimed... [20:25:11] halfak: alright, i commented^ [20:26:04] mutante, still no idea what I need to do here :) [20:26:36] what's pwstore? [20:27:01] halfak: pwstore is an encrypted repo that Ops uses to share passwords [20:27:07] it never works right [20:27:10] (for me at least :) ) [20:27:15] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3215563 (10RobH) [20:27:16] lol [20:27:56] this sounds great. Maybe I should just keep generating keys that only allow a read only ssh to my repos for deployment and just dump them on the deployment hosts. That seems easier at this point. [20:28:17] halfak: i am talking to other ops people when i commented there about pwstore [20:28:29] gotcha. Cool :) [20:28:48] didnt expect you to do anything so far, i think you just need a root to do it all for you [20:28:55] and/or releng [20:29:23] (03PS1) 10Catrope: Enable Flow beta feature on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350478 (https://phabricator.wikimedia.org/T155720) [20:29:51] RECOVERY - Zookeeper Server on conf1003 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg [20:29:56] 06Operations, 06Labs, 10wikitech.wikimedia.org: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3215576 (10Andrew) I've updated https://wikitech.wikimedia.org/wiki/Wikitech-static with some vague maintenance instructions. The root password now... [20:29:56] cool :) [20:30:53] so it seems there was once a proposal to automate this procedure but it was declined at the end (for now) [20:30:58] that's https://phabricator.wikimedia.org/T133211 [20:31:03] !log restart zookeeper on conf1003 after network maintenance [20:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:35] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:33:56] * halfak does the promote_and_restart dance [20:34:08] !log halfak@naos Finished deploy [ores/deploy@cc12103]: T162892 (duration: 21m 28s) [20:34:14] \o/ [20:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:15] T162892: Deploy ORES mid-April - https://phabricator.wikimedia.org/T162892 [20:34:38] All looks good. Thanks for your help mutante & andrewbogott [20:34:54] :) [20:34:59] * halfak gets back to writing papers. [20:35:21] !log bsitzmann@naos Finished deploy [mobileapps/deploy@b5afcb8]: Update mobileapps to 14bd4a5 (duration: 15m 17s) [20:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:14] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/350477/ fixes T163896 and is merged, do you want me to cherry-pick it so you can retry group1? [20:38:15] T163896: Warning: onEnhancedChangesListModifyBlockLineData() expects exactly 4 parameters, 3 given in /srv/mediawiki/php-1.29.0-wmf.21/extensions/Flow/Hooks.php on line 538 - https://phabricator.wikimedia.org/T163896 [20:38:35] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:38:36] RoanKattouw: sure or I can do it [20:38:41] OK [20:38:45] I'm in a meeting, so can you do it? [20:38:53] sure no prob, thanks for the heads up [20:39:47] (03PS1) 10Andrew Bogott: Revert "openstack: Temporarily disable self service instance creation and deletion on horizon" [puppet] - 10https://gerrit.wikimedia.org/r/350480 [20:41:09] (03CR) 10Dzahn: [C: 032] Gerrit: Fix bot by removing if part [puppet] - 10https://gerrit.wikimedia.org/r/350451 (https://phabricator.wikimedia.org/T161525) (owner: 10Paladox) [20:41:38] Uh oh ORES is mad. Looking into it. [20:41:43] (03PS2) 10Andrew Bogott: Revert "openstack: Temporarily disable self service instance creation and deletion on horizon" [puppet] - 10https://gerrit.wikimedia.org/r/350480 [20:44:02] tgr or Nemo_Bis, hi, im wondering could you retry uploading your changes that link to phabricator please. (remove Bug: and then re add it) [20:44:03] please [20:45:29] or amir1 ^^ [20:45:42] I guess ORES had a hiccup? [20:45:44] ladsgroup ^^ [20:46:09] paladox, he seems to be offline now. [20:46:22] Oh, thanks [20:46:46] arlolra is doing a parsoid deploy now. [20:47:39] (03CR) 10Andrew Bogott: [C: 032] Revert "openstack: Temporarily disable self service instance creation and deletion on horizon" [puppet] - 10https://gerrit.wikimedia.org/r/350480 (owner: 10Andrew Bogott) [20:47:41] (03PS2) 10Paladox: Set echoIcon for notification of wikibase in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350418 (owner: 10Ladsgroup) [20:47:47] (03PS3) 10Paladox: Set echoIcon for notification of wikibase in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350418 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [20:48:03] (03CR) 10Paladox: "Im just testing the bot, sorry if this caused spam or noise." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350418 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [20:48:18] !log deploying https://gerrit.wikimedia.org/r/#/c/350481/1 to get the train back on track refs T161733 [20:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:27] T161733: MW-1.29.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T161733 [20:48:30] !log arlolra@naos Started deploy [parsoid/deploy@8d109eb]: Updating Parsoid to 4949857a [20:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:41] Response time on CODFW is terrible. [20:51:41] 06Operations: Ops pwstore currently read-only - https://phabricator.wikimedia.org/T163942#3215653 (10Andrew) [20:51:50] 06Operations: Ops pwstore currently read-only - https://phabricator.wikimedia.org/T163942#3215665 (10Andrew) p:05Triage>03High [20:52:22] akosiaris, any chance you're still around? [20:54:19] yeah. Definitely getting a bunch of timeout errors in codefw [20:54:22] *codfw [20:54:28] twentyafterfour: so the phab cluster support that requires that phab servers can ssh to each other... am i right that as of today it is not active? [20:54:40] mutante: right [20:54:41] twentyafterfour: just asking because phab1001 is really coming now to replace iridium [20:54:51] so Rob is in the middle of setting that up [20:54:53] new hardware [20:55:07] and he just changed the phab1001 from CNAME for iridium to a real IP [20:55:08] yeah the clustering stuff is still disabled pending further testing [20:55:14] mutante: cool [20:55:19] which means that it influenced ferm rules [20:55:20] we had a chat in #releng [20:55:23] !log arlolra@naos Finished deploy [parsoid/deploy@8d109eb]: Updating Parsoid to 4949857a (duration: 06m 52s) [20:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:38] in hiera there is "phab1001" which is used in ferm rules to allow the ssh connect [20:55:53] so changing phab1001 affected that and we wanted to make sure it doesnt break anything [20:56:16] I can test things, the main thing that is at risk is git-ssh.wikimedia.org [20:56:18] once we kill iridium it will be correct as it is now and we just need to change "phabricator_active_server" [20:56:31] twentyafterfour: Sorry about the 'onEnhancedChangesListModifyBlockLineData' bug. I saw the ticket (and put it on my list) this morning but failed to understand the priority of it. [20:56:44] if we break git-ssh it won't be an emergency, not many people using that for their daily work [20:56:47] i looked, i'm pretty sure the only place it is used is the firewall rules for the "ssh between phab servers" [20:57:04] stephanebisson: it's ok, thanks! [20:57:23] I could have marked it unbreak-now but there wasn't any outage so I set it to high [20:57:36] mutante: ok cool then we are all good [20:57:49] :) ok, great [20:58:05] so I'm definitely having a problem with ORES in codfw, but the problem doesn't exist in eqiad. [20:58:42] I'm undeclaring victory. [20:59:59] I need some ops help to diagnose this one, I think. [21:00:04] making a task with notes. [21:02:19] https://phabricator.wikimedia.org/T163944 [21:05:46] Hey folks. Let's say I want to just restart the ORES services on codfw scb2* nodes. How would I do that? [21:05:49] !log Updated Parsoid to 4949857a (T116508, T64270, T133673) [21:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:59] T133673: Add width/height attributes to the