[00:00:21] (03Merged) 10jenkins-bot: Enable ORES on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350488 (https://phabricator.wikimedia.org/T163011) (owner: 10Catrope) [00:00:28] (03CR) 10jenkins-bot: Enable ORES on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350488 (https://phabricator.wikimedia.org/T163011) (owner: 10Catrope) [00:01:46] (03PS7) 10Madhuvishy: sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [00:02:57] (03CR) 10jerkins-bot: [V: 04-1] sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 (owner: 10Madhuvishy) [00:03:37] (03PS2) 10Dzahn: camus: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352649 [00:06:25] (03CR) 10Dzahn: [C: 032] camus: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352649 (owner: 10Dzahn) [00:07:18] (03CR) 10Dzahn: [C: 032] graphite: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352639 (owner: 10Dzahn) [00:07:24] (03PS3) 10Dzahn: graphite: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352639 [00:10:35] !log Running extensions/ORES/maintenance/PopulateDatabase.php on fiwiki [00:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:43] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable ORES on fiwiki (T163011) (duration: 00m 43s) [00:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:51] T163011: Deploy ORES Review Tool to Finnish Wikipedia - https://phabricator.wikimedia.org/T163011 [00:15:45] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [00:19:09] (03CR) 10Dzahn: [C: 032] elasticsearch: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352635 (owner: 10Dzahn) [00:19:15] (03PS2) 10Dzahn: elasticsearch: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352635 [00:23:01] RoanKattouw, I want to deploy https://gerrit.wikimedia.org/r/#/c/352980/ [00:23:29] MaxSem: Go for it, I'm done [00:23:32] thx [00:23:39] (03PS3) 10MaxSem: Wikitech: Remove $smwgNamespacesWithSemanticLinks config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352980 (https://phabricator.wikimedia.org/T53642) (owner: 10Paladox) [00:23:58] (03CR) 10MaxSem: [C: 032] Wikitech: Remove $smwgNamespacesWithSemanticLinks config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352980 (https://phabricator.wikimedia.org/T53642) (owner: 10Paladox) [00:24:15] (03CR) 10Dzahn: [C: 032] mediawiki::jobrunner: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352638 (owner: 10Dzahn) [00:24:21] (03PS2) 10Dzahn: mediawiki::jobrunner: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352638 [00:27:09] (03Merged) 10jenkins-bot: Wikitech: Remove $smwgNamespacesWithSemanticLinks config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352980 (https://phabricator.wikimedia.org/T53642) (owner: 10Paladox) [00:27:17] (03CR) 10jenkins-bot: Wikitech: Remove $smwgNamespacesWithSemanticLinks config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352980 (https://phabricator.wikimedia.org/T53642) (owner: 10Paladox) [00:29:05] !log maxsem@tin Synchronized wmf-config/wikitech.php: https://gerrit.wikimedia.org/r/#/c/352980/3 (duration: 00m 42s) [00:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:32] (03PS3) 10Dzahn: base::puppet: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352640 [00:31:48] paladox, thanks for your contribution :) [00:33:01] (03CR) 10Dzahn: [C: 032] base::puppet: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352640 (owner: 10Dzahn) [00:36:10] (03PS4) 10Dzahn: systemd: use logrotate::conf for logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/351703 [00:36:41] (03PS1) 10MaxSem: Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 [00:38:00] (03PS2) 10MaxSem: Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 (https://phabricator.wikimedia.org/T53642) [00:45:55] (03CR) 10Dzahn: [C: 032] systemd: use logrotate::conf for logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/351703 (owner: 10Dzahn) [00:47:25] (03PS2) 10Dzahn: salt: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352658 [00:56:04] (03PS1) 10TerraCodes: git.wikimedia.org -> phab [software/swift-utils] - 10https://gerrit.wikimedia.org/r/352987 (https://phabricator.wikimedia.org/T139089) [01:00:04] (03CR) 10TerraCodes: [C: 031] git.wikimedia.org -> phab [software/swift-utils] - 10https://gerrit.wikimedia.org/r/352987 (https://phabricator.wikimedia.org/T139089) (owner: 10TerraCodes) [01:01:56] (03CR) 10Dzahn: [C: 032] salt: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352658 (owner: 10Dzahn) [01:13:06] (03PS2) 10Dzahn: profile::base: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352661 [01:14:27] (03CR) 10Dzahn: [C: 032] profile::base: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352661 (owner: 10Dzahn) [01:16:12] (03PS2) 10Dzahn: dynamicproxy: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352660 [01:21:06] (03Abandoned) 10Dzahn: final decom of arsenic (mgmt and asset tag) [dns] - 10https://gerrit.wikimedia.org/r/351671 (https://phabricator.wikimedia.org/T83340) (owner: 10Dzahn) [01:28:15] (03PS12) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 [01:28:34] (03PS13) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 [01:29:15] (03CR) 10Dzahn: [C: 032] "rebased, compiled again http://puppet-compiler.wmflabs.org/6340/ no-op - double-confirming on naos / mira first before tin" [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn) [01:34:25] (03CR) 10Dzahn: "confirmed no-op on naos, mira and tin" [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn) [01:44:16] (03PS1) 10Dzahn: etcd: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352994 [01:47:21] (03PS1) 10Dzahn: apertium: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352996 [01:49:58] (03PS1) 10Dzahn: udp2log: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352998 [01:51:32] (03PS1) 10Dzahn: kafkatee: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352999 [01:53:13] (03CR) 10TerraCodes: [C: 031] "recheck" [software/swift-utils] - 10https://gerrit.wikimedia.org/r/352987 (https://phabricator.wikimedia.org/T139089) (owner: 10TerraCodes) [01:53:38] (03PS1) 10Dzahn: hadoop: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/353000 [01:54:02] (03CR) 10Dzahn: [C: 031] "i think there is no jenkins-bot on this repo @TerraCodes .. lgtm btw" [software/swift-utils] - 10https://gerrit.wikimedia.org/r/352987 (https://phabricator.wikimedia.org/T139089) (owner: 10TerraCodes) [01:55:37] (03PS2) 10Dzahn: apertium: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352996 [01:57:37] (03PS2) 10Dzahn: udp2log: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352998 [01:59:55] (03PS2) 10Dzahn: kafkatee: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352999 [02:01:45] (03PS2) 10Dzahn: etcd: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352994 [02:14:00] (03CR) 10Chad: [V: 032 C: 032] git.wikimedia.org -> phab [software/swift-utils] - 10https://gerrit.wikimedia.org/r/352987 (https://phabricator.wikimedia.org/T139089) (owner: 10TerraCodes) [02:30:15] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.21) (duration: 08m 09s) [02:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:46] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 06m 50s) [02:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:23] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed May 10 03:02:23 UTC 2017 (duration 6m 37s) [03:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:50] (03PS8) 10Madhuvishy: sge: Revamp queue,rqs configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [03:52:28] (03CR) 10jerkins-bot: [V: 04-1] sge: Revamp queue,rqs configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 (owner: 10Madhuvishy) [04:46:46] !log kartik@tin Started deploy [cxserver/deploy@533b4f4]: Update cxserver to 534619c [04:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:24] !log kartik@tin Finished deploy [cxserver/deploy@533b4f4]: Update cxserver to 534619c (duration: 02m 38s) [04:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:04] Amir1: Dear anthropoid, the time has come. Please deploy Cleaning ores_classification table (phab:T159753) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170510T0500). [05:00:05] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [05:02:55] on it [05:04:59] !log start of cleaning up ores_classification rows for three hours [05:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:21] (03PS9) 10Madhuvishy: sge: Revamp queue,rqs configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [05:07:35] (03CR) 10jerkins-bot: [V: 04-1] sge: Revamp queue,rqs configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 (owner: 10Madhuvishy) [05:08:15] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [05:09:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:09:50] (03PS10) 10Madhuvishy: sge: Revamp queue,rqs configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [05:14:44] (03PS11) 10Madhuvishy: sge: Revamp queue,rqs configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [05:17:20] 06Operations: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667#3250511 (10MoritzMuehlenhoff) p:05Triage>03Normal [05:50:47] 06Operations, 10ops-eqiad, 10Analytics, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3250586 (10elukey) Thanks @Dzahn! Next time I will not put the host in role spare but I'll remove everything! [05:53:48] (03PS7) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [05:56:05] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:05] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:06] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:06] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:06] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:06] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:06] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:07] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:07] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:08] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:08] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:08] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:55] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:56:55] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [05:56:55] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:56:55] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:56:55] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:56:55] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:56:56] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:56:57] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:56:57] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:56:58] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:56:58] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [05:56:59] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:58:08] ^ backups probably [05:58:21] <_joe_> yeah I assumed you were working there [05:58:36] <_joe_> Slave_SQL_Running: No, (no error: intentional) LOL [05:58:48] hehe [06:02:42] (03PS8) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [06:04:06] PROBLEM - salt-minion processes on ms-be2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:06] PROBLEM - swift-object-replicator on ms-be2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:06] PROBLEM - swift-container-updater on ms-be2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:06] PROBLEM - swift-account-auditor on ms-be2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:06] PROBLEM - swift-object-updater on ms-be2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:06] PROBLEM - swift-account-reaper on ms-be2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:55] RECOVERY - salt-minion processes on ms-be2009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:04:55] RECOVERY - swift-account-auditor on ms-be2009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:04:55] RECOVERY - swift-account-reaper on ms-be2009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:04:55] RECOVERY - swift-container-updater on ms-be2009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [06:04:55] RECOVERY - swift-object-replicator on ms-be2009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [06:04:56] RECOVERY - swift-object-updater on ms-be2009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [06:08:13] !log Run pt-table-checksum on s7.frwiktionary - T163190 [06:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:22] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [06:08:49] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3250599 (10elukey) Checked as well, thanks for the pointer! tcpdump -i lo doesn't show any RST for apache. [06:11:02] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3250605 (10elukey) Checked on analytics10[32,33] and mcelog shows no events after Chris' maintenance. [06:15:14] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/6343/restbase-dev1001.eqiad.wmnet/ this is now a noop, minus the change in the system role and the eve" [puppet] - 10https://gerrit.wikimedia.org/r/352851 (owner: 10Giuseppe Lavagetto) [06:15:23] !log Deploy alter table wikidatawiki.wb_terms on dbstore1001 - T162539 T163190 [06:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:31] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [06:15:31] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [06:17:34] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3250612 (10elukey) Hosts remaining to do: * analytics1060.eqiad.wmnet * analytics1029.eqiad.wmnet * analytics1037.eqiad.wmnet * an... [06:21:23] 06Operations, 10Traffic, 13Patch-For-Review: prometheus-vhtcpd-stats cronspamming if vhtcpd is not running yet - https://phabricator.wikimedia.org/T157353#3250615 (10elukey) [06:24:45] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [06:27:45] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [06:36:27] !log installing rtmpdump security updates on trusty [06:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:37] 06Operations, 10DNS, 10Traffic, 06Services (next): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3250630 (10MoritzMuehlenhoff) p:05Triage>03High [06:43:17] 06Operations, 10vm-requests: codfw: VM request for poolcounter2001 - https://phabricator.wikimedia.org/T163892#3250633 (10MoritzMuehlenhoff) p:05Triage>03Normal [06:43:37] 06Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: Investigate why firejails break PdfHandler - https://phabricator.wikimedia.org/T164145#3250634 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:02:35] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 853.32 seconds [07:03:04] ^ I am fixing that [07:03:20] PROBLEM - MariaDB Slave SQL: s7 on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:24] PROBLEM - MariaDB Slave IO: s7 on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:36] and that too [07:03:42] (03PS1) 10Jdlrobson: Disable page previews beta features on various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353011 (https://phabricator.wikimedia.org/T164740) [07:04:10] RECOVERY - MariaDB Slave SQL: s7 on db1028 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:04:14] RECOVERY - MariaDB Slave IO: s7 on db1028 is OK: OK slave_io_state Slave_IO_Running: Yes [07:06:05] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:05] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:05] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:06] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:15] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:31] there seems to be lag everywhere on s7 [07:06:35] yeah [07:06:35] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:35] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:35] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:35] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:42] it should be gone in eqiad now [07:06:55] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:55] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:55] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:55] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:55] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:55] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:56] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:56] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:57] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:58] pt-table-checksum and all the hosts for frwitionary having the wrong PK table [07:07:05] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:05] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:52] (03PS1) 10Jdlrobson: Add new Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353012 (https://phabricator.wikimedia.org/T164648) [07:08:28] should be almost gone in codfw too now [07:08:33] all hosts have the "wrong PK tables"- only the master was converted [07:08:41] sorry, i wasn't clear [07:08:47] the revision table on frwitionary [07:08:53] ok [07:09:01] it had the wrong PK on all the hosts, like if they were recentchanges slaves…which we always ignore [07:09:11] lag is gone now, only pending dbstore1002 [07:09:44] lovely snowflakes! [07:10:25] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:10:26] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [07:10:26] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [07:10:26] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:10:45] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:10:45] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [07:10:45] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [07:10:45] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:10:45] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [07:10:46] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [07:10:46] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [07:10:47] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [07:10:47] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:10:55] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [07:10:55] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:11:05] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [07:11:05] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:11:05] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [07:11:05] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:11:05] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:13:19] !log another round of cleaning up ores_classification is done, 12M rows deleted. Current number of rows: 64,902,521 (T159753) [07:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:27] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [07:16:04] !log Disable replication codfw > eqiad on s2 -T147166 T130067 [07:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:11] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [07:16:12] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [07:30:41] !log Stop replication at the same position on db10418 and db2017 - T147166 https://phabricator.wikimedia.org/T130067 [07:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:51] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [07:41:07] 06Operations, 06Discovery: Recurrent Postgres replication lag for codfw maps hosts - https://phabricator.wikimedia.org/T161870#3250719 (10elukey) 05Open>03Resolved a:03elukey [07:41:16] (03PS1) 10Faidon Liambotis: Revert "base::standard_packages: Remove ubuntu precise check" [puppet] - 10https://gerrit.wikimedia.org/r/353013 [07:42:35] PROBLEM - Check whether ferm is active by checking the default input chain on db1069 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [07:43:26] RECOVERY - Check whether ferm is active by checking the default input chain on db1069 is OK: OK ferm input default policy is set [07:45:04] (03PS2) 10Faidon Liambotis: Revert "base::standard_packages: Remove ubuntu precise check" [puppet] - 10https://gerrit.wikimedia.org/r/353013 [07:45:30] (03CR) 10Faidon Liambotis: [C: 032] Revert "base::standard_packages: Remove ubuntu precise check" [puppet] - 10https://gerrit.wikimedia.org/r/353013 (owner: 10Faidon Liambotis) [07:45:49] (03PS2) 10Jdlrobson: Clean up inappropriate usages of wmg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351922 (https://phabricator.wikimedia.org/T151891) [07:46:18] (03PS1) 10Elukey: Fix logrotate config for analytics1003 to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/353014 (https://phabricator.wikimedia.org/T132324) [07:47:48] (03CR) 10Elukey: [V: 032 C: 032] Fix logrotate config for analytics1003 to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/353014 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [07:47:54] (03PS2) 10Elukey: Fix logrotate config for analytics1003 to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/353014 (https://phabricator.wikimedia.org/T132324) [07:48:01] (03CR) 10Elukey: [V: 032 C: 032] Fix logrotate config for analytics1003 to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/353014 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [07:48:35] PROBLEM - Check whether ferm is active by checking the default input chain on db1069 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [07:49:26] RECOVERY - Check whether ferm is active by checking the default input chain on db1069 is OK: OK ferm input default policy is set [07:51:47] 06Operations, 10ops-eqdfw, 10Analytics, 06DC-Ops: SATA errors for stat1004 in the dmesg - https://phabricator.wikimedia.org/T162770#3250730 (10elukey) @Cmjohnson sorry for the late response, didn't notice your answer! So we have two sw raid10 already running, so I'd say AHCI (so not hw raid) but please l... [07:52:15] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [07:55:14] I love how this says OK on IRC and CRIT on the web still [07:55:15] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [07:55:29] ah, again I suppose [08:01:00] (03PS9) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [08:05:37] 06Operations, 10netops: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3250758 (10ayounsi) [08:06:29] fyi, paravoid, akosiaris, ^ , I listed everything I could think about [08:08:18] ok, thanks [08:09:27] addshore: good morning. Have you found out about TwoColConflict yesterday issue ? [08:10:31] (03PS1) 10Jcrespo: Enble puppet on db2062- a one-time experiment for mariadb 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/353015 (https://phabricator.wikimedia.org/T116557) [08:10:49] (03PS2) 10Sfic: Import sources on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352873 (https://phabricator.wikimedia.org/T164573) [08:11:15] Just walking into the office, 2 secs :) [08:11:48] (03CR) 10Marostegui: [C: 031] Enble puppet on db2062- a one-time experiment for mariadb 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/353015 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:12:39] (03PS2) 10Jcrespo: Enble puppet on db2062- a one-time experiment for mariadb 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/353015 (https://phabricator.wikimedia.org/T116557) [08:13:36] (03PS1) 10Hashar: nodepool: do not install Package[libguestfs-tools] [puppet] - 10https://gerrit.wikimedia.org/r/353016 [08:13:51] hashar: reedy took a look and didnt spot anything, left it with Krinkle last night and I woke up this morning and it seems fixed. But not sure if Krinkle actually fixed anything or not [08:14:17] (03CR) 10Hashar: "libguestfs-tools is not needed on labnodepool1001. I have added it as a convenience but that is better done on our laptops. Removal is ht" [puppet] - 10https://gerrit.wikimedia.org/r/353013 (owner: 10Faidon Liambotis) [08:15:52] addshore: if it works fine, I am willing to enable it this morning [08:16:03] if you have bandwidth for that [08:16:10] Awesome, yeh, just double checked and ti all looks good [08:16:32] (03PS5) 10Addshore: wmgUseTwoColConflict true for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 [08:16:36] (03CR) 10Addshore: [C: 031] wmgUseTwoColConflict true for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 (owner: 10Addshore) [08:19:39] XioNoX: nice! [08:19:43] "Use SMTP instead of sendmail" why? :) [08:19:52] we generally tend to do the opposite here [08:19:55] (03CR) 10Muehlenhoff: [C: 032] nodepool: do not install Package[libguestfs-tools] [puppet] - 10https://gerrit.wikimedia.org/r/353016 (owner: 10Hashar) [08:21:58] paravoid: to me it looks cleaner, but I don't really care [08:22:51] it's not really, we run an MTA on every machine that queues emails [08:23:00] and can use multiple outbound relays as needed [08:23:43] so mail delivery just works™ even when one of our two mail relays is down [08:23:52] sounds good [08:23:56] and the local-to-the-dc mail relay is preferred [08:23:57] hashar: or if you want me to I can just do the config switch! :) [08:24:20] so you could in theory do the same in every software, but yeah, why :) [08:25:00] (03PS10) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [08:25:21] 06Operations, 10netops: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3250802 (10ayounsi) [08:25:34] removed from the list! [08:26:13] (03CR) 10Jcrespo: [C: 032] "Works as intended: https://puppet-compiler.wmflabs.org/6348/" [puppet] - 10https://gerrit.wikimedia.org/r/353015 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:26:30] (03PS3) 10Jcrespo: Enble puppet on db2062- a one-time experiment for mariadb 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/353015 (https://phabricator.wikimedia.org/T116557) [08:28:12] addshore: yes please be bold. I am around as needed. [08:28:20] ack! :) [08:28:29] addshore: and will be happy to +1 for the paper work side of things [08:29:25] PROBLEM - configured eth on db1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:29:26] hashar: >> https://gerrit.wikimedia.org/r/350847 go ahead :) [08:29:58] what's with db1069? is it load from pt-table-checksum? [08:30:35] PROBLEM - Disk space on ruthenium is CRITICAL: DISK CRITICAL - free space: / 1739 MB (3% inode=91%) [08:30:49] I am checking [08:30:53] (03CR) 10Hashar: [C: 032] wmgUseTwoColConflict true for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 (owner: 10Addshore) [08:31:03] ^that is probly apt cache and old kernels [08:31:15] RECOVERY - configured eth on db1069 is OK: OK - interfaces up [08:31:20] I don't think db1069 is suffering because of pt-table no [08:31:43] so do we have real network issues? [08:32:12] (03Merged) 10jenkins-bot: wmgUseTwoColConflict true for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 (owner: 10Addshore) [08:32:25] (03CR) 10jenkins-bot: wmgUseTwoColConflict true for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 (owner: 10Addshore) [08:32:37] I am autocleaning on ruthenium, see what we get [08:33:18] addshore: it is on mwdebug1001 / mwdebug1002 [08:34:09] hashar: all looks good! [08:34:18] addshore: is 1.30.0-wmf.1 up to date as well ? [08:34:23] hashar: yup [08:34:33] syncing [08:35:03] (I was only expecting the +1) ;) [08:35:24] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: wmgUseTwoColConflict true for all wikis (duration: 00m 54s) [08:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:36] !log rebooting mx2001 for update to Linux 4.9 [08:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:56] ty! [08:37:32] ruthenium is not that [08:37:37] it is /srv [08:37:48] it is taking 38 GB out of 42 [08:38:44] db1069 load is now decreasing [08:41:25] PROBLEM - swift-account-reaper on ms-be2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:26] PROBLEM - swift-container-server on ms-be2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:35] PROBLEM - swift-container-updater on ms-be2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:45] PROBLEM - swift-object-auditor on ms-be2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:45] PROBLEM - swift-object-replicator on ms-be2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:17] RECOVERY - swift-container-server on ms-be2007 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:42:17] RECOVERY - swift-account-reaper on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:42:17] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3250851 (10MoritzMuehlenhoff) p:05Triage>03High [08:42:25] RECOVERY - swift-container-updater on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:42:32] 06Operations: Racktables: clearly show when hosts are decommissioned - https://phabricator.wikimedia.org/T164042#3250852 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:42:35] RECOVERY - swift-object-auditor on ms-be2007 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:42:35] RECOVERY - swift-object-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:44:32] 06Operations, 10IRCecho: Add flood protection to the ircecho bot (icinga-wm) - https://phabricator.wikimedia.org/T163698#3250855 (10MoritzMuehlenhoff) p:05Triage>03Low [08:46:15] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [08:46:42] 06Operations, 10DBA: Increase timeout for mariadb replication check - https://phabricator.wikimedia.org/T163303#3250856 (10Marostegui) p:05Triage>03Normal [08:47:00] 06Operations, 10Parsoid, 10VisualEditor: ruthenium is going to run out of space on /srv and stop working - https://phabricator.wikimedia.org/T164915#3250857 (10jcrespo) [08:47:39] marostegui, note that eth0 check failed, not replicatin checks [08:47:55] jynus: yes yes :) [08:48:18] But i haven't see anything on logs or the interface itself apart from some dropped errors an interrupt: 35 but it might be old [08:48:27] I thought the load might have made the check to fail [08:48:30] I am still checking [08:49:15] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [08:53:37] (03PS8) 10Volans: Replace $::main_ipaddress by the new ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [08:59:40] !log installing wget security updates on jessie [08:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:25] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 56 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:05:10] !log updated CI puppet compiler facts from production [09:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:40] 06Operations, 10Parsoid, 10VisualEditor: ruthenium is going to run out of space on /srv and stop working - https://phabricator.wikimedia.org/T164915#3250905 (10jcrespo) There is stuff on /dev/mapper/ruthenium--vg-tank (unmounted) ``` afwiki cuwiki enwikivoyage fiwiki hiwiki itwikivoy... [09:06:25] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 11 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:11:11] !log installing vim security updates on jessie [09:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:45] PROBLEM - puppet last run on pybal-test2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [09:13:44] RECOVERY - puppet last run on pybal-test2002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:14:20] (03PS2) 10Jdlrobson: Add new Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353012 (https://phabricator.wikimedia.org/T164648) [09:16:04] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [09:16:24] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [09:16:44] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [09:17:04] PROBLEM - puppet last run on mw2217 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [09:17:04] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 221.61 seconds [09:17:52] (03PS22) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [09:18:54] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [09:21:04] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:21:24] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:21:46] RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:23:36] (03CR) 10Hashar: [C: 031] Import sources on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352873 (https://phabricator.wikimedia.org/T164573) (owner: 10Sfic) [09:24:49] (03PS1) 10Muehlenhoff: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/353018 [09:28:32] (03CR) 10Muehlenhoff: [C: 032] Update comment [puppet] - 10https://gerrit.wikimedia.org/r/353018 (owner: 10Muehlenhoff) [09:32:15] PROBLEM - puppet last run on ms-fe1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [09:32:48] that's me ^ 4.9 kernel install [09:33:54] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:36:56] !log roll-restart ms-fe2* for linux 4.9 upgrade - T162029 [09:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:04] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [09:45:14] RECOVERY - puppet last run on mw2217 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:52:22] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3205379 (10akosiaris) Those RSTs are the result of a packet being sent to an already closed socket. In the tcpdump pasted above, nginx sends the FIN, ACK packet and then... [09:54:30] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352873 (https://phabricator.wikimedia.org/T164573) (owner: 10Sfic) [09:54:50] MaxSem thanks and your welcome :) [09:57:20] (03CR) 10Alexandros Kosiaris: [C: 032] etcd: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352994 (owner: 10Dzahn) [09:57:26] (03PS3) 10Alexandros Kosiaris: etcd: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352994 (owner: 10Dzahn) [09:57:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] etcd: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352994 (owner: 10Dzahn) [10:00:15] RECOVERY - puppet last run on ms-fe1005 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:03:46] 06Operations, 10Traffic, 07HTTPS: wikispecies.org uses an invalid security certificate - https://phabricator.wikimedia.org/T164919#3251002 (10abian) [10:04:39] (03PS3) 10Alexandros Kosiaris: Create kubemaster.svc.$site.wmnet [dns] - 10https://gerrit.wikimedia.org/r/351836 (https://phabricator.wikimedia.org/T162040) [10:04:46] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Create kubemaster.svc.$site.wmnet [dns] - 10https://gerrit.wikimedia.org/r/351836 (https://phabricator.wikimedia.org/T162040) (owner: 10Alexandros Kosiaris) [10:06:11] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 07HTTPS: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3251019 (10abian) [10:06:13] 06Operations, 10Traffic, 07HTTPS: wikispecies.org uses an invalid security certificate - https://phabricator.wikimedia.org/T164919#3251021 (10abian) [10:10:06] (03PS4) 10Alexandros Kosiaris: lvs: Add the kubernetes master service/cluster [puppet] - 10https://gerrit.wikimedia.org/r/352580 (https://phabricator.wikimedia.org/T162040) [10:10:08] (03PS4) 10Alexandros Kosiaris: Migrate to using kubemaster.svc.$site.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/352581 [10:12:34] (03CR) 10Alexandros Kosiaris: [C: 032] Use a service cert for kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/352860 (owner: 10Alexandros Kosiaris) [10:12:39] (03PS2) 10Alexandros Kosiaris: Use a service cert for kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/352860 [10:12:42] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Use a service cert for kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/352860 (owner: 10Alexandros Kosiaris) [10:15:44] PROBLEM - Check systemd state on argon is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:16:29] 06Operations, 10Parsoid, 10VisualEditor: ruthenium is going to run out of space on /srv and stop working - https://phabricator.wikimedia.org/T164915#3251039 (10jcrespo) Waiting for @ssastry because this seems to be an application problem, (took 30 gb overnight), and fixing it without knowing why it is taking... [10:17:10] (03PS1) 10Alexandros Kosiaris: Fix the kubemaster.svc.$site.wmnet key path [puppet] - 10https://gerrit.wikimedia.org/r/353030 [10:17:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix the kubemaster.svc.$site.wmnet key path [puppet] - 10https://gerrit.wikimedia.org/r/353030 (owner: 10Alexandros Kosiaris) [10:20:24] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[kube-apiserver] [10:21:44] (03PS2) 10Jcrespo: db: Comment db1015 being defective [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352188 [10:23:44] RECOVERY - Check systemd state on argon is OK: OK - running: The system is fully operational [10:27:47] (03PS1) 10Alexandros Kosiaris: Fix user/group ownership for kubernetes certs [puppet] - 10https://gerrit.wikimedia.org/r/353032 [10:28:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix user/group ownership for kubernetes certs [puppet] - 10https://gerrit.wikimedia.org/r/353032 (owner: 10Alexandros Kosiaris) [10:30:24] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [10:31:47] (03PS3) 10Jcrespo: db: Comment db1015 being defective [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352188 [10:31:49] (03PS1) 10Jcrespo: mariadb: Depool db1056 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353034 [10:33:05] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1056 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353034 (owner: 10Jcrespo) [10:38:17] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1056 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353034 (owner: 10Jcrespo) [10:38:32] (03CR) 10Jcrespo: [C: 032] db: Comment db1015 being defective [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352188 (owner: 10Jcrespo) [10:38:43] (03CR) 10jenkins-bot: db: Comment db1015 being defective [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352188 (owner: 10Jcrespo) [10:40:00] (03PS5) 10Alexandros Kosiaris: lvs: Add the kubernetes master service/cluster [puppet] - 10https://gerrit.wikimedia.org/r/352580 (https://phabricator.wikimedia.org/T162040) [10:40:06] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] lvs: Add the kubernetes master service/cluster [puppet] - 10https://gerrit.wikimedia.org/r/352580 (https://phabricator.wikimedia.org/T162040) (owner: 10Alexandros Kosiaris) [10:40:44] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:41:59] (03Merged) 10jenkins-bot: mariadb: Depool db1056 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353034 (owner: 10Jcrespo) [10:42:11] (03CR) 10jenkins-bot: mariadb: Depool db1056 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353034 (owner: 10Jcrespo) [10:43:09] !log Disable replication codfw > eqiad on s7 - T147166 T130067 [10:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:18] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [10:43:18] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [10:44:48] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1056 for reimage (duration: 00m 43s) [10:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:51] !log Stop replication at the same position on db1033 and db2029 - T147166 T130067 [10:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:59] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [10:51:00] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [10:56:48] (03CR) 10Volans: [C: 031] "LGTM, puppet compiler results available at https://puppet-compiler.wmflabs.org/6351/" [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [10:57:02] paravoid: I'm about to merge it ^^^ (FYI) [10:58:21] (03PS9) 10Volans: Replace $::main_ipaddress by the new ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [10:58:23] (03PS1) 10Alexandros Kosiaris: Add role::lvs::realserver to role::kubernetes::master [puppet] - 10https://gerrit.wikimedia.org/r/353039 [11:00:46] (03PS1) 10Alexandros Kosiaris: Empty keys for kubemaster.svc.$site.wmnet certs [labs/private] - 10https://gerrit.wikimedia.org/r/353041 [11:01:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Empty keys for kubemaster.svc.$site.wmnet certs [labs/private] - 10https://gerrit.wikimedia.org/r/353041 (owner: 10Alexandros Kosiaris) [11:02:59] (03CR) 10Volans: [C: 032] Replace $::main_ipaddress by the new ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [11:06:01] Did someone rename this class role::deployment::server? [11:06:08] it is saying there is no such class [11:06:17] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::deployment::server for phab-tin.phabricator.eqiad.wmflabs on node phab-tin.phabricator.eqiad.wmflabs [11:06:34] paladox: I think it became deployment_server IIRC at some point [11:06:37] not sure why/when [11:06:47] Oh, i guess at 3am today [11:07:02] I got alot of notifications around 3am saying puppet errors . [11:07:06] thanks [11:07:38] ah [11:07:39] https://github.com/wikimedia/puppet/commit/5c4f02d6d9a5c3e550dcf13d314f7f1bdf88e308 [11:08:56] (03CR) 10Ema: "One comment, looks good otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351663 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [11:10:15] PROBLEM - puppet last run on chlorine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[kube-apiserver] [11:11:15] akosiaris: seems unrelated to my merge, but let me know if otherwise ^^^ [11:12:14] yes unrelatred [11:14:17] (03PS2) 10Volans: lvs: replace $::ipaddress_eth0 by $::ipaddress [puppet] - 10https://gerrit.wikimedia.org/r/350765 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [11:14:43] !log Stop replication at the same position on db1050 and db2028 [11:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:53] !log Stop replication at the same position on db1049 and db2023 [11:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:24] PROBLEM - configured eth on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:24] PROBLEM - MD RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:24] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:26:14] RECOVERY - configured eth on ms-be1020 is OK: OK - interfaces up [11:26:15] RECOVERY - MD RAID on ms-be1020 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [11:26:15] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [11:27:00] !log stopping mariadb and preparing db1056 for reimage [11:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:33] (03PS1) 10DCausse: [WIP] [cirrus] Blacklist wikinews and wikiversity from cross project search on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353043 (https://phabricator.wikimedia.org/T163463) [11:40:21] (03PS3) 10Volans: lvs: replace $::ipaddress_eth0 by $::ipaddress [puppet] - 10https://gerrit.wikimedia.org/r/350765 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [11:47:04] RECOVERY - swift eqiad-prod object availability on graphite1001 is OK: OK: Less than 1.00% under the threshold [95.0] [11:50:44] (03PS1) 10Alexandros Kosiaris: Fix typo in kubemaster key filenames [labs/private] - 10https://gerrit.wikimedia.org/r/353045 [11:50:47] (03PS1) 10Alexandros Kosiaris: Fix another typo with kubernetes master keys [labs/private] - 10https://gerrit.wikimedia.org/r/353046 [11:51:46] <_joe_> incoming [11:51:58] (03PS11) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [11:52:01] (03PS1) 10Giuseppe Lavagetto: cassandra::instance: allow use of default values [puppet] - 10https://gerrit.wikimedia.org/r/353047 [11:52:03] (03PS1) 10Giuseppe Lavagetto: restbase: convert test cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353048 [11:52:05] (03PS1) 10Giuseppe Lavagetto: profile::cassandra: auto-generate fqdns for seeds [puppet] - 10https://gerrit.wikimedia.org/r/353049 [11:52:07] (03PS1) 10Giuseppe Lavagetto: restbase: convert production cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353050 [11:52:38] 06Operations, 05MW-1.30-release-notes, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3251167 (10Gilles) In production, the migration steps also require purging, other... [11:54:46] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix typo in kubemaster key filenames [labs/private] - 10https://gerrit.wikimedia.org/r/353045 (owner: 10Alexandros Kosiaris) [11:54:52] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix another typo with kubernetes master keys [labs/private] - 10https://gerrit.wikimedia.org/r/353046 (owner: 10Alexandros Kosiaris) [11:57:33] (03PS4) 10Jcrespo: [WIP] Create scripts for batch sql execution [puppet] - 10https://gerrit.wikimedia.org/r/338809 [11:58:36] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Create scripts for batch sql execution [puppet] - 10https://gerrit.wikimedia.org/r/338809 (owner: 10Jcrespo) [12:10:48] (03PS1) 10Nschaaf: Add QuickSurvey for reader segmentation research [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353053 (https://phabricator.wikimedia.org/T131949) [12:11:34] (03CR) 10Nschaaf: [C: 04-1] "Still needs name parameter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353053 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [12:12:00] (03PS6) 10BBlack: maps->upload functional cluster-level changes [puppet] - 10https://gerrit.wikimedia.org/r/351663 (https://phabricator.wikimedia.org/T164608) [12:12:02] (03PS2) 10BBlack: maps->upload: delete maps-specific things [puppet] - 10https://gerrit.wikimedia.org/r/352834 (https://phabricator.wikimedia.org/T164608) [12:12:04] (03PS1) 10BBlack: maps->upload: move LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/353054 (https://phabricator.wikimedia.org/T164608) [12:30:53] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3251296 (10fgiunchedi) >>! In T162796#3248654, @Gilles wrote: > The top 100 most requested sizes represent 91.21% of all requests. The remaining l... [12:32:27] jynus marostegui _joe_: I have something interesting to show: https://goo.gl/psVnUY. Per logstash, Wikidata has never been in read-only mode since we deployed redis dispatching lock manager [12:33:52] but it has been around 200 times in every three hours (per logstash, again) [12:34:15] !log installing logback security updates [12:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:18] jouncebot: refresh [12:45:20] I refreshed my knowledge about deployments. [12:45:23] jouncebot: next [12:45:23] In 0 hour(s) and 14 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170510T1300) [12:45:27] !log rebooting ganeti2007, ganeti2008 for networking config update [12:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:16] phuedx: hello, I guess your swat patch https://gerrit.wikimedia.org/r/#/c/353038/ has to be backported to 1.29.0-wmf.21 as well ? [12:47:30] !log installing irqbalance updates from jessie point update [12:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:28] Amir1, I think someone said that wikidata doesn't support mediawiki's read-only mode [12:54:53] or something along those lines, do not know the details [12:55:29] I highly doubt that [12:55:35] the idea being that extensions have to check if things are on read only, and not try to write or they will fail [12:55:53] Amir1, it was not me who said it, so I do not know the details [12:56:07] I practically touched every part of the wikibase code and never seen such thing [12:56:08] of what someone meant with that, etc. [12:56:25] gilles: your metadata patch at https://gerrit.wikimedia.org/r/#/c/353042/ are you sure you want to push it to prod today ? [12:56:39] probably, I ask Daniel [12:56:54] I may be reminding things wrongly [12:56:57] hashar: yes [12:56:59] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353057 [12:57:04] so do not take may word for granted [12:57:04] (03PS3) 10Hashar: Import sources on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352873 (https://phabricator.wikimedia.org/T164573) (owner: 10Sfic) [12:57:30] jouncebot: next [12:57:30] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170510T1300) [12:57:39] Amir1, what I know for sure is that some extensions do not support it [12:57:56] because we saw some tries that error with a db error [12:58:12] (which we on purpose also make read only to avoid issues) [12:58:32] (03PS2) 10Hashar: Disable page previews beta features on various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353011 (https://phabricator.wikimedia.org/T164740) (owner: 10Jdlrobson) [12:58:34] (03PS3) 10Hashar: Add new Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353012 (https://phabricator.wikimedia.org/T164648) (owner: 10Jdlrobson) [12:58:36] (03PS3) 10Hashar: Clean up inappropriate usages of wmg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351922 (https://phabricator.wikimedia.org/T151891) (owner: 10Jdlrobson) [13:00:02] Yeah, I'd say handling should be more graceful most of the times [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170510T1300). Please do the needful. [13:00:04] Sfic, Jdlrobson, phuedx, and gilles: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:13] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352873 (https://phabricator.wikimedia.org/T164573) (owner: 10Sfic) [13:00:26] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353011 (https://phabricator.wikimedia.org/T164740) (owner: 10Jdlrobson) [13:00:37] Hello [13:00:59] there is also another related issue, and it is related to T151681 [13:01:00] T151681: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681 [13:01:17] (03Merged) 10jenkins-bot: Import sources on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352873 (https://phabricator.wikimedia.org/T164573) (owner: 10Sfic) [13:01:28] (03CR) 10jenkins-bot: Import sources on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352873 (https://phabricator.wikimedia.org/T164573) (owner: 10Sfic) [13:01:36] because of the current model, long runnning threads do not have its configuration updated [13:01:36] (03Merged) 10jenkins-bot: Disable page previews beta features on various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353011 (https://phabricator.wikimedia.org/T164740) (owner: 10Jdlrobson) [13:01:54] which means all kind of issues [13:02:00] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353012 (https://phabricator.wikimedia.org/T164648) (owner: 10Jdlrobson) [13:02:19] not a long time ago, hashar and me saw a long running maintenance that had been running for days with an oudated codebase [13:02:28] *task [13:02:55] jdlrobson: going to deploy your patches soonish [13:02:56] (03Merged) 10jenkins-bot: Add new Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353012 (https://phabricator.wikimedia.org/T164648) (owner: 10Jdlrobson) [13:03:35] (03CR) 10jenkins-bot: Disable page previews beta features on various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353011 (https://phabricator.wikimedia.org/T164740) (owner: 10Jdlrobson) [13:03:37] (03CR) 10jenkins-bot: Add new Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353012 (https://phabricator.wikimedia.org/T164648) (owner: 10Jdlrobson) [13:03:43] (03PS2) 10Alexandros Kosiaris: Add role::lvs::realserver to role::kubernetes::master [puppet] - 10https://gerrit.wikimedia.org/r/353039 [13:03:47] (03CR) 10Alexandros Kosiaris: [C: 032] Add role::lvs::realserver to role::kubernetes::master [puppet] - 10https://gerrit.wikimedia.org/r/353039 (owner: 10Alexandros Kosiaris) [13:03:47] hashar: awesome [13:03:49] maybe we should avoid those harder or have a reloadconfig() method, but that is mediawiki [13:03:50] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add role::lvs::realserver to role::kubernetes::master [puppet] - 10https://gerrit.wikimedia.org/r/353039 (owner: 10Alexandros Kosiaris) [13:03:51] im reading to test when you are [13:03:53] o/ [13:04:01] sorry i'm a little late [13:04:02] I am pushing them one by one [13:04:14] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Import sources on dty.wikipedia - T164573 (duration: 00m 43s) [13:04:14] to mwdebug1002 [13:04:14] ? [13:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:22] T164573: Enable Import feature in Doteli Wikipedia - https://phabricator.wikimedia.org/T164573 [13:04:37] phuedx: you probably want to backport your change https://gerrit.wikimedia.org/r/#/c/353038/ to the 1.29.0-wmf.21 branch dont you? [13:05:05] hashar: i can backport to that branch as well [13:05:05] sec [13:05:10] phuedx: since 1.30.0-wmf.1 is only on group0 right now ( http://tools.wmflabs.org/versions/ ) [13:05:39] hashar: If I'm able to get this task done real quick could I add a patch to EU swat? [13:05:44] jdlrobson: I am syncing the change to disable page previews on all wikis [13:05:49] sweet [13:06:50] both "disable page previews" and "arabic logo" are on mwdebug1001 / mwdebug1002 [13:06:59] hashar: https://gerrit.wikimedia.org/r/#/c/353058/ [13:07:32] phuedx: hopefully it is valid :-} [13:07:44] both can be deployed hashar [13:07:45] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [13:07:50] !log Run pt-table-checksum on s7.hewiki - https://phabricator.wikimedia.org/T163190 [13:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:04] (03CR) 10Volans: [C: 031] "LGTM, noop on puppet compiler, see" [puppet] - 10https://gerrit.wikimedia.org/r/350765 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [13:08:15] (03Draft2) 10Zppix: Correct alias(es) from es.wikisource to eo.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353059 (https://phabricator.wikimedia.org/T164888) [13:08:25] RECOVERY - puppet last run on chlorine is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:08:40] hashar can I add gerrit:353059 to swat? [13:09:13] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Add new Arabic Wikipedia logo - T164648 && Disable page previews beta features on various projects - T164740 (duration: 00m 42s) [13:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:21] T164648: Use the correct Arabic Wikipedia wordmark on mobile site - https://phabricator.wikimedia.org/T164648 [13:09:21] T164740: Disable page previews on projects where implementation isn't optimal - https://phabricator.wikimedia.org/T164740 [13:09:39] jdlrobson: https://gerrit.wikimedia.org/r/#/c/351922/3/wmf-config/InitialiseSettings.php you are dropping wmgMFEditorOptions [13:09:47] jdlrobson: but I guess it is not needed anymore ? [13:10:19] hashar: not needed. this is an artifact from when we rolled out anonymous editing [13:10:34] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351922 (https://phabricator.wikimedia.org/T151891) (owner: 10Jdlrobson) [13:10:45] also not doing anything :) as no wg equivalent [13:11:32] phuedx: popups patch is on 1.30.0-wmf.1 wikis [13:11:45] phuedx: err on 1.30.0-wmf.1 wikis but solely on mwdebug1001 / mwdebug1002 for now [13:11:52] hashar: did you sync the arabic wrdmark? [13:11:57] * phuedx wipes the sweat of his brow [13:12:00] hashar: scary ;) [13:12:13] (03CR) 10Ema: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/351663 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [13:12:26] jdlrobson: I forgot to sync the svg :( [13:12:36] hashar: cool https://ar.m.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-ar.svg was 404ing [13:12:36] !log restart pybal on lvs1006, lvs1009, lvs1012 to pick up the kubemaster LVS service [13:12:41] which was scaring me :) [13:12:43] (03Merged) 10jenkins-bot: Clean up inappropriate usages of wmg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351922 (https://phabricator.wikimedia.org/T151891) (owner: 10Jdlrobson) [13:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:54] (03CR) 10jenkins-bot: Clean up inappropriate usages of wmg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351922 (https://phabricator.wikimedia.org/T151891) (owner: 10Jdlrobson) [13:12:54] syncing it now [13:13:04] !log hashar@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-ar.svg: Add new Arabic Wikipedia logo - T164648 (duration: 00m 44s) [13:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:26] I'm sorry if i was already answered (i accidently cleared my backlog) but can I have https://gerrit.wikimedia.org/r/#/c/353059/ added to today's eu swat? [13:13:58] jdlrobson: https://ar.m.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-ar.svg still yields a 404 :( [13:14:24] (03PS3) 10Fdans: Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) [13:14:39] hashar: strangely it doesnt for debug1002 [13:14:47] do you need to touch the file? [13:14:59] https://ar.m.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-ar.svg?r=3 [13:15:03] cache busting makes it show [13:15:07] maybe it needs to propagate [13:15:07] (03CR) 10jerkins-bot: [V: 04-1] Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [13:16:43] hashar: 👍 [13:16:54] jdlrobson: touched it and syncing again [13:17:03] jdlrobson: I have no idea how static stuff gets cached though [13:17:15] !log hashar@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-ar.svg: (no justification provided) (duration: 00m 42s) [13:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:33] phuedx: ok syncing [13:17:40] hashar: lgtm, did a general prod of page previews on testwiki against mwdebug1002 and saw no errors in the console [13:18:50] hashar: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Image_Cache_Purges [13:19:10] hashar: I think my irc bouncer is laggy right now, bear with me if I don't respond immediately [13:19:12] hashar: does that help ^ ? [13:19:21] !log hashar@tin Synchronized php-1.30.0-wmf.1/extensions/Popups: eventLogging: Discard events with duplicate tokens - T161769 T163198 (duration: 01m 08s) [13:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:30] T161769: Schema:Popups sends extraneous link interaction events in control condition - https://phabricator.wikimedia.org/T161769 [13:19:32] T163198: Track instances of duplicate Popups events being logged - https://phabricator.wikimedia.org/T163198 [13:19:32] phuedx: the 1.29.0-wmf.21 version of your Popups change is now on 1.29.0-wmf.21 mwdebug hosts [13:19:39] jdlrobson: yup :) [13:19:46] ah I did a nspurge [13:19:50] hashar: <3 [13:19:56] testing now [13:20:15] (03CR) 10Zppix: "@hashar can we add this to 5/10's EU swat?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353059 (https://phabricator.wikimedia.org/T164888) (owner: 10Zppix) [13:21:33] jdlrobson: looks good now [13:21:42] hashar: I can see the static image now as well [13:22:03] jdlrobson: I have purged a combination of (--wiki=enwiki,arwiki) (en.wikipedia.org|en.m.wikipedia.org|ar.wikipedia.org|ar.m.wikipedia.org) [13:22:09] one of the combo apparently managed to purge something [13:22:16] Zppix: :) [13:22:38] hashar: same again: lgtm [13:22:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353061 (https://phabricator.wikimedia.org/T130067) [13:22:50] tested on hewiki on mwdebug1002 [13:22:51] (03PS4) 10Fdans: Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) [13:23:14] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [13:23:26] (03PS23) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [13:23:41] yay [13:23:42] hashar: it works [13:23:44] phuedx: syncing it [13:23:46] jdlrobson: \O/ [13:23:56] thanks. does that mean all my changes are synced? [13:24:06] jdlrobson: note: I have no clue what that arabic sentence means :} [13:24:10] (03CR) 10jerkins-bot: [V: 04-1] Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [13:24:16] jdlrobson: yeah all synced. [13:24:17] me neither. Hopefully Nirzar is not trolling me ;-) [13:24:21] !log hashar@tin Synchronized php-1.29.0-wmf.21/extensions/Popups: eventLogging: Discard events with duplicate tokens - T161769 T163198 (duration: 00m 43s) [13:24:25] phuedx: done :-) [13:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:30] T161769: Schema:Popups sends extraneous link interaction events in control condition - https://phabricator.wikimedia.org/T161769 [13:24:30] T163198: Track instances of duplicate Popups events being logged - https://phabricator.wikimedia.org/T163198 [13:24:34] hashar: thanks [13:24:38] (03PS1) 10Jcrespo: mariadb-install_server: Allow temporary full reimage of db1056 [puppet] - 10https://gerrit.wikimedia.org/r/353063 [13:26:19] (03PS1) 10Ayounsi: Add new logstash LVS service based on the listeners listed in modules/role/manifests/logstash/collector.pp Inspired by https://gerrit.wikimedia.org/r/#/c/324371/ [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) [13:26:24] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [13:26:25] PROBLEM - Unmerged changes on repository mediawiki_config on naos is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [13:26:34] PROBLEM - Host ganeti2006 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:54] PROBLEM - Host ganeti2005 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:14] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [13:27:14] RECOVERY - Host ganeti2006 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [13:27:14] RECOVERY - Host ganeti2005 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [13:27:21] jdlrobson: sorry there is still the cleanup change to deploy [13:27:43] jdlrobson: pulled it on mwdebug hosts [13:27:56] hashar: on it [13:28:01] !log Disable replication codfw > eqiad on s1 - https://phabricator.wikimedia.org/T147166 https://phabricator.wikimedia.org/T130067 [13:28:08] thanks to icinga-wm :) [13:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:02] hashar: looks good to me [13:29:08] logs are clear? [13:29:19] looks like [13:29:42] sweet [13:30:49] going to sync to prod [13:32:24] RECOVERY - Unmerged changes on repository mediawiki_config on naos is OK: No changes to merge. [13:32:24] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [13:33:46] 06Operations, 05MW-1.30-release-notes, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3251491 (10Gilles) Got confirmation that varnish entries for originals normally e... [13:33:52] (03CR) 10BBlack: [C: 04-1] Add new logstash LVS service based on the listeners listed in modules/role/manifests/logstash/collector.pp Inspired by https://gerrit.wikime (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [13:34:20] 06Operations, 05MW-1.30-release-notes, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3251494 (10Gilles) [13:34:41] !log hashar@tin Synchronized wmf-config/CommonSettings.php: Clean up inappropriate usages of wmg - T151891 (duration: 00m 42s) [13:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:49] T151891: Clean up inappropriate usages of wmg prefix in Reading-web maintained extensions - https://phabricator.wikimedia.org/T151891 [13:35:42] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Clean up inappropriate usages of wmg - T151891 (duration: 00m 42s) [13:35:43] jdlrobson: synced [13:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:12] (03PS5) 10Fdans: Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) [13:36:33] (03PS7) 10BBlack: maps->upload functional cluster-level changes [puppet] - 10https://gerrit.wikimedia.org/r/351663 (https://phabricator.wikimedia.org/T164608) [13:36:37] Last patch standing is gilles one to mediawiki/core https://gerrit.wikimedia.org/r/#/c/353042/ [13:36:58] (03CR) 10jerkins-bot: [V: 04-1] Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [13:37:20] (03PS1) 10Ema: cache_upload VTC tests: update to reflect the <1K exception [puppet] - 10https://gerrit.wikimedia.org/r/353066 [13:37:22] gilles: i would rather keeps that one on beta cluster for now and push it as part of train next week or in a standalone window? [13:37:50] (03CR) 10BBlack: [C: 032] maps->upload functional cluster-level changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351663 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [13:37:58] hashar: if possible can we also do https://gerrit.wikimedia.org/r/#/c/353059 if its not too much to ask [13:38:19] hashar: why? it's well tested [13:38:35] I need to get moving on that, I believe there's no train next week [13:38:45] ah yeah offsides / hackathon etc [13:39:05] I guess if something fails that is solely on testwiki for now [13:39:34] PROBLEM - Host ganeti2005 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:37] don't worry I'm going to spend the next 2 hours testing it every way I can on testwiki and mediawiki.org [13:39:51] on all file types, etc. [13:39:54] RECOVERY - Host ganeti2005 is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [13:40:04] PROBLEM - Host ganeti2006 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:24] gilles: beware of the max header size on swift end. I can well imagine a document having lot of different pages in different sizes that would end up failling [13:40:34] RECOVERY - Host ganeti2006 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [13:40:39] gilles: pushing :) [13:40:52] (03PS2) 10Ayounsi: Add new logstash LVS service based on the listeners listed in modules/role/manifests/logstash/collector.pp Inspired by https://gerrit.wikimedia.org/r/#/c/324371/ [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) [13:41:00] hashar: I've tested it on documents with hundreds of different pages, all of different sized and it was still well under the limit [13:41:04] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:41:09] it compresses the dimension information somewhat [13:41:31] I'm sure we'll find some files that go over the limit, but it's not a big deal [13:41:47] the contingency plan if there are many is to limit by the smallest page in the document [13:42:00] I mean if there are so many that exceed the limit that it's a problem [13:42:07] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/6366/kubernetes1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/352581 (owner: 10Alexandros Kosiaris) [13:42:09] (03CR) 10Ema: "We can get rid of modules/varnish/files/tests/maps/ too." [puppet] - 10https://gerrit.wikimedia.org/r/352834 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [13:42:17] (03PS5) 10Alexandros Kosiaris: Migrate to using kubemaster.svc.$site.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/352581 [13:42:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Migrate to using kubemaster.svc.$site.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/352581 (owner: 10Alexandros Kosiaris) [13:47:38] gilles: also keep in mind some files will have the extra headers due to 1.30.0-wmf.1 but would still be access by 1.29.0-wmf.21 which does not have support for that header. But I guess it will just ignore it :D [13:48:02] gilles: should we pass via mwdebug1001 or do we go straight to prod ? [13:48:12] (03PS6) 10Fdans: Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) [13:49:09] RECOVERY - puppet last run on cp2020 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:50:10] hashar: nothing consumes the header right now [13:50:10] hashar: mwdebug1001 [13:50:10] I'll try uploading a file on testwiki first, mwdebug1001 should be enough for that [13:50:10] once it's in prod it's going to take me a while to do exhaustive testing, but if uploading works there's little risk of a massive fuck-up [13:50:21] ok [13:51:16] hashar: Can I deploy: https://gerrit.wikimedia.org/r/#/c/353061/ or should I wait? (There is no rush) [13:52:04] (03PS12) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [13:52:06] (03PS2) 10Giuseppe Lavagetto: cassandra::instance: allow use of default values [puppet] - 10https://gerrit.wikimedia.org/r/353047 [13:52:08] (03PS2) 10Giuseppe Lavagetto: restbase: convert test cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353048 [13:52:10] (03PS2) 10Giuseppe Lavagetto: profile::cassandra: auto-generate fqdns for seeds [puppet] - 10https://gerrit.wikimedia.org/r/353049 [13:52:12] (03PS2) 10Giuseppe Lavagetto: restbase: convert production cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353050 [13:52:21] the maintenance scripts for testwiki to migrate data took 10 minutes or so to run, after manually testing uploading different things, I'll migrate group0 [13:52:22] and if all goes well, group1 after tonight's train, etc. [13:53:14] !log reboot kafka200[23] for kernel upgrades (kafka main-codfw cluster, eventbus codfw) [13:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:23] gilles: let's coordinate once you get to big wikis, commons of course being one [13:53:36] gilles: ok pulled on mwdebug1001 [13:53:40] godog: yeah, that would be tomorrow afaik [13:53:42] there was another wiki that itself wasn't big but had a ton of uploads or thumbs [13:53:55] hashar: thanks, testing [13:55:15] (03PS1) 10Gehel: maps - renamed cassandra passwords for role / profile refactoring [labs/private] - 10https://gerrit.wikimedia.org/r/353068 [13:55:36] gilles: ok, how big is the maintenance script batch of files btw? [13:55:47] all files [13:55:54] batches are 200 at once [13:56:33] hashar: not seeing the header on a freshly upload file, but I wonder if the upload request would correctly go to mwdebug1001 [13:56:45] (03PS1) 10BBlack: maps->upload: fix kartotherian be_opts [puppet] - 10https://gerrit.wikimedia.org/r/353069 (https://phabricator.wikimedia.org/T164608) [13:57:14] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frlog1001 - https://phabricator.wikimedia.org/T163127#3251555 (10Jgreen) I had no trouble with dhcp/pxe/tftp install and it's up and running now, did that resolve the interface flap situation? [13:57:30] (03CR) 10BBlack: [V: 032 C: 032] maps->upload: fix kartotherian be_opts [puppet] - 10https://gerrit.wikimedia.org/r/353069 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [13:57:34] 06Operations, 10netops, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3251556 (10ayounsi) 05Resolved>03Open Today the analytics hosts saturated their uplinks for about 2h, so that goes beyond a reasonable t... [13:57:51] gilles: ack, thanks, I'll keep an eye on swift when commons happens [13:59:21] hashar: I don't see the latest git hash in Special:Version on testwiki, with the browser extension pointing to mwdebug1001, it points to: https://phabricator.wikimedia.org/rMW9774ce61eb050ffbdcfb7fea48dbd9e2bcb45d3c [13:59:59] is that supposed to point to the right thing when you push a change this way? [14:00:04] maybe not [14:00:48] it's always confusing to check from the website's perspective if the patch was applied when you don't see the effect it's supposed to have :) [14:00:51] gilles: yeah the git info cache has not been refreshed [14:02:13] gilles: on mwdebug1001 the release notes file does mention X-Content-Dimensions header [14:02:17] the x-wikimedia-debug is in the request header, should do what it's supposed to... [14:02:32] of the upload request [14:02:58] (03PS2) 10Ema: cache_upload VTC tests: update to reflect the <1K exception [puppet] - 10https://gerrit.wikimedia.org/r/353066 [14:03:06] (03CR) 10Ema: [V: 032 C: 032] cache_upload VTC tests: update to reflect the <1K exception [puppet] - 10https://gerrit.wikimedia.org/r/353066 (owner: 10Ema) [14:03:12] let me scap pull again [14:03:36] did [14:04:02] jynus, mutante .. .that explains the odd failures i am seeing with the test run ... i can stop the service but puppet will restart it again. those are only test runs. [14:04:08] reg T164915 [14:04:09] T164915: ruthenium is going to run out of space on /srv and stop working - https://phabricator.wikimedia.org/T164915 [14:04:34] gilles: want me to pull it on mwdebug1002 as well? [14:04:36] hashar: push it to prod, I'll be more certain that it's running in the upload codepath [14:04:45] ok [14:04:48] right now it's at worst harmless, I can upload fine [14:05:35] 06Operations, 05Goal, 13Patch-For-Review, 07kubernetes: Eliminate SPOFs in the existing eqiad kubernetes infrastructure - https://phabricator.wikimedia.org/T162040#3251578 (10akosiaris) The single master (with a manual override) SPOF has been addressed. We now have 2 masters in eqiad behind an LVS service... [14:05:51] 06Operations, 05Goal, 13Patch-For-Review, 07kubernetes: Eliminate SPOFs in the existing eqiad kubernetes infrastructure - https://phabricator.wikimedia.org/T162040#3251579 (10akosiaris) [14:07:46] 06Operations, 10Parsoid, 10VisualEditor: ruthenium is going to run out of space on /srv and stop working - https://phabricator.wikimedia.org/T164915#3251606 (10ssastry) /dev/mapper/ruthenium--vg-tank was supposed to be mounted at /srv/visualdiff/pngs ... i suppose on a recent reboot, it didn't get mounted ..... [14:08:40] 14:06:59 Running command: `find -O2 '/srv/mediawiki-staging/php-1.30.0-wmf.1' -not -type d -name '*.php' -not -name 'autoload_static.php' -or -name '*.inc' | xargs -n1 -P12 -exec php -l >/dev/null` [14:08:41] .... [14:08:48] that is surely going to take a while [14:09:03] hashar: is swat completed? [14:09:13] almost [14:09:22] ok [14:09:53] if upload works fine we can consider the swat over, it doesn't matter if the changeset didn't work [14:10:04] investigating it one way or the other will take a while anyway [14:13:06] 06Operations, 10Parsoid: ruthenium is going to run out of space on /srv and stop working - https://phabricator.wikimedia.org/T164915#3251655 (10ssastry) [14:13:32] pfff [14:13:34] !log ValueError: /srv/mediawiki-staging/php-1.30.0-wmf.1/extensions/Collection/.eslintrc.json is an invalid JSON file [14:13:34] ... [14:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:30] !log hashar@tin Started scap: (no justification provided) [14:15:30] !log hashar@tin scap aborted: (no justification provided) (duration: 00m 00s) [14:15:36] !log hashar@tin Started scap: Store original media dimensions as additional header - T150741 [14:15:37] !log hashar@tin scap aborted: Store original media dimensions as additional header - T150741 (duration: 00m 00s) [14:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:41] !log hashar@tin Started scap: Store original media dimensions as additional header - T150741 [14:15:43] grr [14:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:51] T150741: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741 [14:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:32] gilles: I end up running a full scap [14:18:33] I think I've just realized what's likely wrong [14:18:42] we have a bit in puppet that whitelists the headers swift stores [14:19:21] forgot about that. anyway we'll test that upload still works [14:19:30] (03PS1) 10BBlack: maps->upload: keep the upload sec-related headers upload-only [puppet] - 10https://gerrit.wikimedia.org/r/353073 (https://phabricator.wikimedia.org/T164608) [14:19:33] mediawiki probably correctly sends the header and swift ignores it [14:19:34] !log hashar@tin Finished scap: Store original media dimensions as additional header - T150741 (duration: 03m 53s) [14:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:02] (03CR) 10BBlack: [V: 032 C: 032] maps->upload: keep the upload sec-related headers upload-only [puppet] - 10https://gerrit.wikimedia.org/r/353073 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [14:20:03] gilles: done probably. [14:21:33] hashar: upload works fine, I'll schedule what's missing for the next SWAT window. thanks! [14:21:35] (03PS1) 10Giuseppe Lavagetto: service::node: report timing data from check-service to statsd [puppet] - 10https://gerrit.wikimedia.org/r/353075 [14:21:52] !log upgrading mw1263-mw1265 to latest HHVM package (including the redis QUIT patch) [14:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:31] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3251689 (10Papaul) @RobH let me know when you want to start working on this. Next week works for me. [14:24:28] hashar: Can I deploy wmf-config/db-eqiad.php then? [14:25:25] marostegui: yes should be good [14:25:33] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/6369/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/353075 (owner: 10Giuseppe Lavagetto) [14:25:38] thanks! [14:25:38] marostegui: we are acting on mediawiki/core so a patch to db.php should be all fine ) [14:25:39] (03PS2) 10Giuseppe Lavagetto: service::node: report timing data from check-service to statsd [puppet] - 10https://gerrit.wikimedia.org/r/353075 [14:25:43] (03PS2) 10Marostegui: db-eqiad.php: Depool db1097 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353057 [14:25:48] gilles: well done :-} [14:26:35] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] service::node: report timing data from check-service to statsd [puppet] - 10https://gerrit.wikimedia.org/r/353075 (owner: 10Giuseppe Lavagetto) [14:27:10] !log European SWAT completed [14:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:31] (03PS1) 10Gilles: Whitelist X-Content-Dimensions in swift [puppet] - 10https://gerrit.wikimedia.org/r/353078 (https://phabricator.wikimedia.org/T150741) [14:29:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353057 (owner: 10Marostegui) [14:31:07] akosiaris: 6ca0bf12e24b5efb949da0abed8cb12c9e21601c broke puppet on tools k8s workers Could not find data item profile::kubernetes::master_fqdn in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/kubernetes/node.pp [14:31:49] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#3251708 (10jcrespo) I have created the foundations for this: https://phabricator.wikimedia.org/P5395#29087 [14:32:32] subbu, I can disable puppet [14:32:45] but there are several services running there [14:33:05] (03CR) 10BBlack: [C: 031] lvs: remove support for <= trusty [puppet] - 10https://gerrit.wikimedia.org/r/350769 (owner: 10Faidon Liambotis) [14:33:05] jynus it is fine ... i fixed it so that even if it starts up, it won't start up a new test run. [14:33:07] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Whitelist X-Content-Dimensions in swift [puppet] - 10https://gerrit.wikimedia.org/r/353078 (https://phabricator.wikimedia.org/T150741) (owner: 10Gilles) [14:33:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353057 (owner: 10Marostegui) [14:33:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353057 (owner: 10Marostegui) [14:33:35] what I mean is, can we stop all of them, do you know about the other dirs I mention, are they the same service? [14:34:07] yes, you can stop all services, no problem .. [14:34:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097 (duration: 00m 43s) [14:34:20] are you talking about dirs in /srv/ or dirs in the unmapped volume? [14:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:30] *unmounted [14:34:35] the dirs on /srv [14:34:50] (03PS4) 10Volans: lvs: replace $::ipaddress_eth0 by $::ipaddress [puppet] - 10https://gerrit.wikimedia.org/r/350765 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [14:35:07] I do not know what is there on the volume- I will move them to old [14:35:12] and check them later [14:35:16] they are all needed .. but the contents in /srv/visualdiff/pngs can be emptied out .. T164915#3251606 [14:35:16] T164915: ruthenium is going to run out of space on /srv and stop working - https://phabricator.wikimedia.org/T164915 [14:35:19] they looked like images or somethjig [14:35:28] jynus, they are images yes ...created by the test run. [14:35:37] but that is not accesible [14:35:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353080 [14:35:40] it was umounted [14:35:44] volans: I'm checking that one right quick [14:35:45] yes, it needs to be mounted. [14:35:49] mutante, had previously mounted it .. [14:35:51] 06Operations, 05codfw-rollout: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#3251714 (10akosiaris) [14:35:54] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3251712 (10akosiaris) 05Open>03Resolved ganeti2007, ganeti2008 are installed, fully updated (along with ganeti2005, ganeti2006) and part of the cluster. I... [14:35:56] but i suppose it wasn't puppetized. [14:36:00] so did this restart or something? [14:36:01] so on a reboot, it went unmounted again. [14:36:04] ok [14:36:05] volans: (the puppet compiler runs didn't check lvs?) [14:36:07] so that is the issue [14:36:15] ssastry@ruthenium:/srv/visualdiff/testreduce$ uptime [14:36:15] 14:36:08 up 14 days, 21:12, 1 user, load average: 0.10, 0.07, 0.34 [14:36:19] where do I mount the enwiki stuf? [14:36:19] so, yes, rebooted 14 days ago. [14:36:28] volans: oh ignore me, they did heh [14:36:33] bblack: yes is in https://puppet-compiler.wmflabs.org/6361/ [14:36:35] noop [14:36:37] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1005.eqiad.wmnet [14:36:41] but sure double check :) [14:36:41] jynus mount /dev/mapper/ruthenium--vg-tank at /srv/visualdiff/pngs [14:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:45] more eyes the better [14:36:50] because now we will have maybe duplicate content [14:36:51] !log roll-restart swift-proxy to apply https://gerrit.wikimedia.org/r/#/c/353078/ [14:36:53] you can delete everything in /srv/visualdiff/pngs [14:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:00] this is what I am going to do [14:37:06] (03CR) 10BBlack: [C: 031] lvs: replace $::ipaddress_eth0 by $::ipaddress [puppet] - 10https://gerrit.wikimedia.org/r/350765 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [14:37:09] mount the hidden partiion on /mnt [14:37:31] backup /srv/visualdiff/pngs [14:37:36] jynus, no need to backup. [14:37:36] bblack: thanks, good to merge for you? [14:37:45] it will regenerate in 8-12 hours .. it is just test run data. [14:37:49] chasemp: ah, indeed. I guess I should add a look up for that value. I 'll add it in hieradata/labs/tools/ in the puppet repo, I assume that's the correct one [14:37:51] ok, then [14:38:06] volans: yup [14:38:11] ok, proceeding [14:38:12] but I am not sure if that will solve fully tthe problem, subbu [14:38:17] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353080 (owner: 10Marostegui) [14:38:18] let me see [14:38:24] (03CR) 10Volans: [C: 032] lvs: replace $::ipaddress_eth0 by $::ipaddress [puppet] - 10https://gerrit.wikimedia.org/r/350765 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [14:38:30] volans: if you're checking compiler on all of those, as long as lvs/cp/dns come back as no-op where appropriate I'm good :) [14:38:40] jynus, mounting it at /srv/visualdiff/pngs will solve it now ... puppetizing that moujnt will ensure that this wont repeat on reboots. [14:38:48] :) [14:38:53] akosiaris: tbh tools and esp k8s hiera is an ungodly mess but that's as fine a place as any of the 3 candidates if it works [14:39:01] k8s hiera for tools I mean [14:39:18] heh [14:39:18] ok [14:39:41] wait, what the 3rd candidate ? [14:39:45] !log disabling puppet to solve disk mount issues T164915 [14:39:50] it's hieradata, and horizon, right ? [14:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:05] wikitech hiera is disabled, right ? [14:40:17] akosiaris: I don't believe it is [14:40:21] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353080 (owner: 10Marostegui) [14:40:22] ah [14:40:24] hmm [14:40:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353080 (owner: 10Marostegui) [14:40:34] yuvi did something in that direction but to my knowledge he didn't wrap it up [14:40:49] that's a tender spot for us atm as everyone hates it etc [14:41:16] (03PS3) 10Volans: dnsrecursor: use ipaddress6, not ipaddress6_eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350766 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [14:41:17] so subbu, confirm ok with "/srv/visualdiff/pngs$ rm -Rf *" [14:41:31] double confirm it :-) [14:41:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097 (duration: 00m 43s) [14:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:52] jynus, yes, :) [14:41:55] it is true, there is 31 GB of data there [14:42:39] RECOVERY - Disk space on ruthenium is OK: DISK OK [14:43:52] !log Run pt-table-checksum on s7.huwiki - https://phabricator.wikimedia.org/T163190 [14:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353061 (https://phabricator.wikimedia.org/T130067) (owner: 10Marostegui) [14:44:14] (03PS2) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353061 (https://phabricator.wikimedia.org/T130067) [14:45:17] (03CR) 10Volans: [C: 031] "LGTM, noop on compiler: https://puppet-compiler.wmflabs.org/6372/" [puppet] - 10https://gerrit.wikimedia.org/r/350766 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [14:46:57] subbu, it should be working now, and it should rememver the partitioning after a reboot [14:47:02] *remember [14:48:40] gilles: restarted swift-proxy everywhere [14:48:50] godog: thanks [14:49:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353061 (https://phabricator.wikimedia.org/T130067) (owner: 10Marostegui) [14:49:30] subbu, https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=ruthenium&var-network=eth0&from=now-24h&to=now [14:49:39] I will restart puppet now [14:50:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 - T147166 T130067 (duration: 00m 43s) [14:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:10] (03PS3) 10Volans: labs: remove the _eth0 suffix from ipaddress facts [puppet] - 10https://gerrit.wikimedia.org/r/350767 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [14:50:11] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [14:50:11] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [14:50:17] !log Stop replication at the same position on db1067 and db2016 - https://phabricator.wikimedia.org/T147166 https://phabricator.wikimedia.org/T130067 [14:50:23] jynus, thanks. [14:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:21] 06Operations, 10Parsoid: ruthenium is going to run out of space on /srv and stop working - https://phabricator.wikimedia.org/T164915#3251786 (10jcrespo) 05Open>03Resolved a:03jcrespo I have mounted /dev/mapper/ruthenium--vg-tank on /srv/visualdiff/pngs type ext4 (rw,relatime,data=ordered), deleted old /s... [14:52:24] 06Operations, 13Patch-For-Review: logrotate for ruthenium - https://phabricator.wikimedia.org/T161920#3251789 (10jcrespo) [14:55:57] chasemp: yeah you were right. I 've ended up adding them next to all the other ones in wikitech [14:56:02] problem fixed [14:56:13] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3251818 (10Cmjohnson) [14:57:20] !log reboot kafka1001 for kernel upgrades (kafka main-eqiad, eventbus eqiad) [14:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:32] akosiaris: cool thanks [15:02:09] 06Operations, 10ops-eqiad: Relocate db1056 to rack C3 - https://phabricator.wikimedia.org/T164944#3251847 (10Cmjohnson) [15:02:51] 06Operations, 10ops-eqiad: Analytics1040 system board repair needed - https://phabricator.wikimedia.org/T164942#3251865 (10Cmjohnson) [15:03:47] !log shutting down db1056 for pysical maintenance T164944 [15:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:55] T164944: Relocate db1056 to rack C3 - https://phabricator.wikimedia.org/T164944 [15:04:00] (03CR) 10DatGuy: [C: 031] "Zppix, see https://wikitech.wikimedia.org/wiki/Deployments - They need to be scheduled there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353059 (https://phabricator.wikimedia.org/T164888) (owner: 10Zppix) [15:04:41] DatGuy: i'm aware... [15:04:54] DatGuy: I was asking for a last min patch addition [15:05:12] it should still be available via adding it there and using joucebot refresh [15:05:23] (03PS2) 10Jcrespo: mariadb-install_server: Allow temporary full reimage of db1056 [puppet] - 10https://gerrit.wikimedia.org/r/353063 [15:05:28] it was during swat... look at the timestamp [15:05:36] (03CR) 10Jcrespo: [C: 032] mariadb-install_server: Allow temporary full reimage of db1056 [puppet] - 10https://gerrit.wikimedia.org/r/353063 (owner: 10Jcrespo) [15:05:56] then maybe go on IRC, but I doubt that gerrit was monitored during the deploy [15:06:37] DatGuy: i asked on both, however the messages i sent were overlooked (or missed) [15:06:44] alright [15:06:47] just noting :) [15:08:09] akosiaris, chasemp, is someone actively working to fix the tools puppet issue? [15:08:30] Ah! And here come the recovery emails, just like that [15:08:33] (03CR) 10Chad: "Needs namespaceDupes (at least dry run) being done on the two wikis, in case there's any conflicts." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353059 (https://phabricator.wikimedia.org/T164888) (owner: 10Zppix) [15:08:33] andrewbogott: I spot checked tools-worker-05 and it seemed fixed [15:09:21] andrewbogott: already fixed [15:09:28] yep, so I see, thanks [15:10:25] (03CR) 10Volans: [C: 032] dnsrecursor: use ipaddress6, not ipaddress6_eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350766 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [15:10:31] (03PS4) 10Volans: dnsrecursor: use ipaddress6, not ipaddress6_eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350766 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [15:10:38] (03PS1) 10Filippo Giunchedi: rt-hacks: add maint-announce_add_to_gcal.js [software] - 10https://gerrit.wikimedia.org/r/353087 [15:13:35] (03PS1) 10Ayounsi: Fix: [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) [15:14:54] (03CR) 10Chad: Fix: (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [15:14:55] XioNoX: the first line of the commit needs to be a summary of the changes [15:15:05] haha [15:15:19] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - kubemaster_6443 - Could not depool server chlorine.eqiad.wmnet because of too many down! [15:15:29] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - kubemaster_6443 - Could not depool server chlorine.eqiad.wmnet because of too many down! [15:15:29] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - kubemaster_6443 - Could not depool server chlorine.eqiad.wmnet because of too many down! [15:15:30] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - kubemaster_6443 - Could not depool server chlorine.eqiad.wmnet because of too many down! [15:15:43] paravoid: can't do shorter than "fix" :) [15:15:43] Dereckson, I wonder why you moved T162845 from 'under discussion' to 'config'? [15:15:43] T162845: Creating the "interface editor" permission on Portuguese Wikivoyage - https://phabricator.wikimedia.org/T162845 [15:15:48] <_joe_> akosiaris: ^^ you I guess [15:15:51] XioNoX: https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines [15:16:00] chlroine, that is that? [15:16:04] *what [15:16:06] paravoid: but yeah, will do next time, my bad [15:16:12] <_joe_> jynus: kube master [15:16:17] XioNoX: you can amend now :) [15:16:24] _joe_: yup [15:16:24] I see now that they have local 'crats [15:16:27] no worries [15:16:42] is it open for claiming? [15:17:10] (03PS2) 10Ayounsi: Various LibreNMS improvements [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) [15:17:15] better :) [15:17:25] thanks! [15:17:49] but but it doesn't use the imperative! [15:17:53] :-) [15:18:33] !log uploaded HHVM 3.18.2 and HHVM extensions to apt.wikimedia.org/main (previously only in experimental) [15:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:52] (03PS7) 10Elukey: Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [15:18:59] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [15:21:12] heh, a combination of labs instance churn and eventstreams maybe [15:21:43] (03CR) 10Volans: [C: 04-1] "I don't know the general librenms configuration so I cannot comment on that part, but I think there are typos, see comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [15:22:19] (03CR) 10Faidon Liambotis: [C: 04-1] Various LibreNMS improvements (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [15:22:42] (03CR) 10Fdans: [C: 031] Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [15:22:52] godog: eventstreams ? [15:22:55] strange, dbquery errors have been reduced a lot in the last 30 minutes [15:22:56] (I just rebooted kafka1001) [15:22:59] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [15:23:16] (03PS1) 10Alexandros Kosiaris: Revert "Fix user/group ownership for kubernetes certs" [puppet] - 10https://gerrit.wikimedia.org/r/353092 [15:23:29] something is not right, either on infra or on monitoring (logs) [15:23:45] (03PS1) 10Andrew Bogott: Horizon: add novaadmin ldap creds to the horizon config. [puppet] - 10https://gerrit.wikimedia.org/r/353093 (https://phabricator.wikimedia.org/T162097) [15:23:47] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Fix user/group ownership for kubernetes certs" [puppet] - 10https://gerrit.wikimedia.org/r/353092 (owner: 10Alexandros Kosiaris) [15:23:51] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Fix user/group ownership for kubernetes certs" [puppet] - 10https://gerrit.wikimedia.org/r/353092 (owner: 10Alexandros Kosiaris) [15:24:00] (03CR) 10Faidon Liambotis: [C: 04-1] Various LibreNMS improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [15:24:02] something deployed around 14:45? [15:24:29] depool of db1067 [15:25:01] (03Draft1) 10Paladox: deployment_server: Labs does not support ipv6 so we need to allow ipv6 to be an undef [puppet] - 10https://gerrit.wikimedia.org/r/353091 [15:25:10] (03PS2) 10Andrew Bogott: Horizon: add novaadmin ldap creds to the horizon config. [puppet] - 10https://gerrit.wikimedia.org/r/353093 (https://phabricator.wikimedia.org/T162097) [15:25:16] (03PS2) 10Paladox: deployment_server: Labs does not support ipv6 so we need to allow ipv6 to be an undef [puppet] - 10https://gerrit.wikimedia.org/r/353091 [15:25:21] andrewbogott: two spaces there [15:25:24] (03PS3) 10Paladox: deployment_server: Labs does not support ipv6 so we need to allow ipv6 to be an undef [puppet] - 10https://gerrit.wikimedia.org/r/353091 [15:25:59] elukey: yeah eventstreams creates new metrics but I don't know under what conditions [15:26:17] paravoid: after my : you mean? [15:26:21] andrewbogott: yes [15:26:29] andrewbogott: also a trailing dot :) [15:26:29] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [15:26:35] * andrewbogott learned to type on an actual, mechanical typewriter [15:26:48] haha [15:26:50] (03PS3) 10Andrew Bogott: Horizon: add novaadmin ldap creds to the horizon config [puppet] - 10https://gerrit.wikimedia.org/r/353093 (https://phabricator.wikimedia.org/T162097) [15:27:19] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [15:27:29] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [15:27:29] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [15:27:49] ??? [15:28:09] akosiaris: are those the kubemaster recoveries? [15:28:40] (03PS4) 10Andrew Bogott: Horizon: add novaadmin ldap creds to the horizon config [puppet] - 10https://gerrit.wikimedia.org/r/353093 (https://phabricator.wikimedia.org/T162097) [15:28:49] and I guess is on all LVSes because it will be able to host services in all LVS groups... :) [15:29:01] (03PS3) 10Ayounsi: Various LibreNMS improvements [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) [15:29:02] marostegui, did you stop replication on a pooled slave? [15:29:32] because it is causing 150000 errors per minute [15:29:59] (03CR) 10jerkins-bot: [V: 04-1] Various LibreNMS improvements [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [15:30:05] jynus: what? [15:30:13] https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php [15:30:23] volans: not sure what to answer... yes ? [15:30:38] akosiaris: the recoveries or the guess? :D [15:30:39] I merged it :| [15:30:48] the recoveries [15:30:52] I mean, it gets depooled automatically, but it is creating lots of log traffic [15:31:10] but no for this "(06:28:49 μμ) volans: and I guess is on all LVSes because it will be able to host services in all LVS groups... :)" [15:31:18] jynus: I didn't do the rebase [15:31:20] pushing now [15:31:20] those lvs hosts are the low-traffic group [15:31:21] 06Operations, 06Performance-Team: Some Core availability Catchpoint tests might be more expensive than they need to be - https://phabricator.wikimedia.org/T162857#3251961 (10MoritzMuehlenhoff) p:05Triage>03Normal [15:31:28] grr [15:31:28] XioNoX: puppet-lint is your friend, helps to catch those V-1 locally [15:31:41] and that's where the HTTP REST API for kubernetes is for now [15:31:45] ok [15:31:53] funnily enough, services will probably be on the same hosts [15:32:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 - T147166 T130067 (duration: 01m 43s) [15:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:03] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [15:33:04] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [15:33:15] who decided that add_ip6_mapped belongs into modules/profile and not site.pp? :) [15:33:37] paravoid: yeah I have the linter integrated in atom, but missed that one [15:36:12] (03CR) 10Andrew Bogott: [C: 032] Horizon: add novaadmin ldap creds to the horizon config [puppet] - 10https://gerrit.wikimedia.org/r/353093 (https://phabricator.wikimedia.org/T162097) (owner: 10Andrew Bogott) [15:36:14] (03PS4) 10Paladox: deployment_server: Labs does not support ipv6 so we need to disable ipv6 in labs [puppet] - 10https://gerrit.wikimedia.org/r/353091 [15:36:54] (03CR) 10Faidon Liambotis: [C: 04-2] "No, that shouldn't be done with Hiera. I'll fix this in a different way for this and other hosts." [puppet] - 10https://gerrit.wikimedia.org/r/353091 (owner: 10Paladox) [15:37:11] (03CR) 10jerkins-bot: [V: 04-1] deployment_server: Labs does not support ipv6 so we need to disable ipv6 in labs [puppet] - 10https://gerrit.wikimedia.org/r/353091 (owner: 10Paladox) [15:37:15] (03CR) 10Paladox: "ok thanks." [puppet] - 10https://gerrit.wikimedia.org/r/353091 (owner: 10Paladox) [15:37:20] (03Abandoned) 10Paladox: deployment_server: Labs does not support ipv6 so we need to disable ipv6 in labs [puppet] - 10https://gerrit.wikimedia.org/r/353091 (owner: 10Paladox) [15:38:50] strange in labs it cannot find this profile::mediawiki::deployment::server class but it is offered through horizion interface. [15:40:26] actually scratch that ^^ it is only shown if i add profile::mediawiki::deployment::server in the other class section. Otherwise i cannot find it on the puppet tab. [15:40:50] ah found it [15:43:03] (03PS4) 10Ayounsi: Various LibreNMS improvements [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) [15:44:11] (03Draft1) 10Paladox: deployment_server: Fix misspelt variable [puppet] - 10https://gerrit.wikimedia.org/r/353094 [15:44:14] (03PS2) 10Paladox: deployment_server: Fix misspelt variable [puppet] - 10https://gerrit.wikimedia.org/r/353094 [15:44:40] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3252050 (10fgiunchedi) @gilles I've added two more sheets for hit and miss+pass from webrequest to the spreadsheet, looks like in April there were... [15:44:49] !log instaling git security updates on jessie systems [15:44:56] 06Operations, 10Traffic: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3252052 (10ema) Note that the limit cannot be set using a configuration parameter but rather by defining a storage backend named "Transient". For example: `-s Transient=malloc,1G`. See https://www... [15:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:14] (03PS4) 10Faidon Liambotis: labs: remove the _eth0 suffix from ipaddress facts [puppet] - 10https://gerrit.wikimedia.org/r/350767 (https://phabricator.wikimedia.org/T163196) [15:45:16] (03PS5) 10Faidon Liambotis: Switch add_ip6_mapped to use interface_primary [puppet] - 10https://gerrit.wikimedia.org/r/345568 (https://phabricator.wikimedia.org/T163196) [15:45:18] (03PS3) 10Faidon Liambotis: Remove c/p interface argument to add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/350768 (https://phabricator.wikimedia.org/T163196) [15:45:20] (03PS3) 10Faidon Liambotis: lvs: remove support for <= trusty [puppet] - 10https://gerrit.wikimedia.org/r/350769 [15:45:22] (03PS5) 10Faidon Liambotis: interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) [15:45:24] (03PS5) 10Faidon Liambotis: cache: use interface_primary instead of eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) [15:45:26] (03PS1) 10Faidon Liambotis: Move all add_ip6_mapped calls to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/353095 [15:45:38] paladox: https://gerrit.wikimedia.org/r/353095 [15:45:50] thanks :) [15:45:59] _joe_: fyi, https://gerrit.wikimedia.org/r/353095 may be of interest to you -- I'm sure you won't like it much but I think it's the right way here [15:46:09] or well, I'm not sure, but I suspect [15:47:05] (03CR) 10Volans: [C: 031] "LGTM, puppet compiler is noop on lab* and silver hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/350767 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [15:47:24] paravoid: do you know if we can test with the compiler deployment-prep stuff? [15:47:31] no idea [15:47:58] _joe_ maybe? ^^^ [15:48:37] <_joe_> paravoid: actually in the role/profile model, it should go in roles [15:48:43] <_joe_> I think I even commented [15:48:50] <_joe_> in the puppet coding structure [15:49:04] why? [15:49:07] <_joe_> if you have 1:1 role => node definition [15:50:25] sounds pretty gross to me :) [15:50:37] <_joe_> what sounds gross? [15:52:14] well roles are supposed to be 1:N, no? [15:52:21] <_joe_> what do you mean? [15:52:23] same role can be applied to multiple nodes [15:52:28] not 1:1 [15:52:28] <_joe_> yes [15:52:37] <_joe_> no, but 1 node definition => 1 role [15:52:42] yes, sure [15:53:04] IPv6 configuration (or not) is a host property, not a role property I'd say [15:53:19] case in point: we want to be able to apply the same role to other hosts that have no IPv6 connectivity to begin with (Labs) [15:53:51] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frlog1001 - https://phabricator.wikimedia.org/T163127#3252118 (10Jgreen) 05Open>03Resolved a:03Jgreen host is working fine, and the interface flap issue appears to have stopped [15:53:53] I don't think this is better done with hiera or realm.pp or other conditionals, I think we should just be doing this per host for now [15:54:06] adding add_ip6_mapped needs changes in DNS as well [15:56:48] <_joe_> yeah I'm honestly neutral at this point, but the logic would be "if you need to add something to a node outside of a role, create a derived node" [15:57:00] <_joe_> this might be the only case where I'm ok either way [15:57:03] derived node? [15:57:09] <_joe_> sorry, derived role [15:57:12] ah [15:57:24] yeah in the general case I think I agree with you [15:57:43] in this case I really don't want us to start having role(labs_deployment_server) and such [15:57:59] (03CR) 10Paladox: [C: 031] "Tested all this and works :)" [puppet] - 10https://gerrit.wikimedia.org/r/353095 (owner: 10Faidon Liambotis) [15:58:08] paladox: thanks :) [15:58:15] Your welcome :) [15:58:18] i've tested all your changes [15:58:33] and works [15:59:09] (03CR) 10Paladox: [C: 031] "Tested this :)" [puppet] - 10https://gerrit.wikimedia.org/r/350767 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [15:59:19] (03CR) 10Filippo Giunchedi: [C: 04-1] "Looks good overall, a mixture of comments including nits" (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [16:01:49] 06Operations, 10Page-Previews, 06Performance-Team, 06Reading-Web-Backlog, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#732861 (10Tbayer) Hi, I have been trying to get up to speed on this before our meeting today. Reading through the discu... [16:08:26] PROBLEM - DPKG on ms-be1039 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:09:26] RECOVERY - DPKG on ms-be1039 is OK: All packages OK [16:11:38] 06Operations, 10ops-eqiad, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3252214 (10Cmjohnson) @madhuvishy @chasemp Moved the labstores back to row C...labstore1001 is in c2 and labstore100... [16:12:34] 06Operations, 10ops-eqiad: Relocate db1056 to rack C3 - https://phabricator.wikimedia.org/T164944#3252236 (10Cmjohnson) 05Open>03Resolved Moved the server to c3....accessible via ssh --resolving [16:13:38] 06Operations, 10ops-eqiad, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3252239 (10madhuvishy) Woot thanks @Cmjohnson :D [16:13:40] PROBLEM - Host analytics1040 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:00] PROBLEM - MariaDB Slave Lag: s1 on db2016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5016.49 seconds [16:14:00] PROBLEM - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5016.54 seconds [16:14:10] PROBLEM - MariaDB Slave Lag: s1 on db2034 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5026.26 seconds [16:14:10] PROBLEM - MariaDB Slave Lag: s1 on db2048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5026.29 seconds [16:14:11] PROBLEM - MariaDB Slave Lag: s1 on db2069 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5026.67 seconds [16:14:13] PROBLEM - MariaDB Slave Lag: s1 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5028.98 seconds [16:14:13] PROBLEM - MariaDB Slave Lag: s1 on db2071 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5029.03 seconds [16:14:13] PROBLEM - MariaDB Slave Lag: s1 on db2062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5030.12 seconds [16:14:35] PROBLEM - MariaDB Slave Lag: s1 on db1067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5045.65 seconds [16:14:36] PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5046.61 seconds [16:14:52] lost downtimes? [16:15:07] <_joe_> and he was definitely/win 17 [16:15:10] <_joe_> argh [16:15:39] sounds like lost downtime, analytics1040 e.g. is down with a broken mainboard [16:15:47] 06Operations, 10Deployment-Systems, 13Patch-For-Review, 10Scap (Scap3-MediaWiki-MVP), 15User-Joe: Install conftool on deployment masters - https://phabricator.wikimedia.org/T163565#3252249 (10demon) Oh, duh other total obvious usecase I forgot: being able to pull our target list from etcd directly, inste... [16:16:12] PROBLEM - Check systemd state on labstore2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:16:14] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[cni] [16:16:31] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:16:34] ACKNOWLEDGEMENT - Host analytics1040 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T164942 [16:16:36] moritzm: yep, lost downtimes, because they were downtimed earlier for 24h, they have been stopped for 2 hours now [16:17:10] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3252259 (10jcrespo) I think this is an almost an unbreak now, if this keeps happening. [16:17:12] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 21 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] [16:17:31] PROBLEM - Elasticsearch HTTPS on elastic2020 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2020.codfw.wmnet [16:17:41] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] [16:18:00] PROBLEM - MD RAID on elastic2020 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0 [16:18:01] ACKNOWLEDGEMENT - MD RAID on elastic2020 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T164953 [16:18:01] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 14 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] [16:18:05] 06Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T164953#3252261 (10ops-monitoring-bot) [16:18:30] PROBLEM - mediawiki-installation DSH group on mw2256 is CRITICAL: Host mw2256 is not in mediawiki-installation dsh group [16:18:50] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[cni] [16:19:20] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/2: down - Cust: Airport Express WiFi APBR [16:19:29] (03PS1) 10Papaul: DHCP/partman: Add dhcp and partman entries for kubernetes200[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/353098 [16:20:11] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3003_v4, cp3003_v6, cp3009_v4, cp3009_v6 [16:21:00] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3003_v4, cp3003_v6, cp3009_v4, cp3009_v6 [16:21:32] (03PS1) 10Amire80: Set collation for Bashkir wikis to uppercase-ba [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) [16:21:40] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3003_v4, cp3003_v6, cp3009_v4, cp3009_v6 [16:21:40] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3003_v4, cp3003_v6, cp3009_v4, cp3009_v6 [16:22:30] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3003_v4, cp3003_v6, cp3009_v4, cp3009_v6 [16:23:10] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3003_v4, cp3003_v6, cp3009_v4, cp3009_v6 [16:24:20] PROBLEM - MegaRAID on heze is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [16:24:23] ACKNOWLEDGEMENT - MegaRAID on heze is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T164955 [16:24:29] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T164955#3252287 (10ops-monitoring-bot) [16:24:40] PROBLEM - DPKG on maerlant is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:25:40] RECOVERY - DPKG on maerlant is OK: All packages OK [16:25:41] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3252292 (10Cmjohnson) [16:25:49] 06Operations, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3051441 (10Cmjohnson) [16:27:08] 06Operations, 10ops-eqiad, 10DBA: Decommission db1024 - https://phabricator.wikimedia.org/T164702#3252301 (10Cmjohnson) I am ready whenever you are to decom and remove the rack [16:27:40] RECOVERY - MariaDB Slave Lag: s1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89988.22 seconds [16:29:30] PROBLEM - DPKG on nescio is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:30:14] !log roll-restart swift object servers to apply https://gerrit.wikimedia.org/r/#/c/353078 [16:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:30] RECOVERY - DPKG on nescio is OK: All packages OK [16:32:49] (03PS1) 10DCausse: [cirrus] remove elastic quirks after elastic 5.3 upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353100 [16:37:10] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [16:37:36] (03CR) 10Dzahn: "yep, sorry, i don't know what i was thinking when i merged that since i had even done the same thing before and abandoned it" [puppet] - 10https://gerrit.wikimedia.org/r/353013 (owner: 10Faidon Liambotis) [16:39:10] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3252329 (10jcrespo) [16:44:42] 06Operations, 10Parsoid: ruthenium is going to run out of space on /srv and stop working - https://phabricator.wikimedia.org/T164915#3252340 (10Dzahn) Thanks for this @jcrespo I mounted that on March 31st as a stop gap because it was running out of disk (didn't have logrotate afair T161920) and should have edi... [16:46:42] (03PS1) 10Krinkle: webperf: Remove remnants of webperf::asset_check [puppet] - 10https://gerrit.wikimedia.org/r/353104 (https://phabricator.wikimedia.org/T164419) [16:46:46] (03CR) 10Krinkle: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/352302 (https://phabricator.wikimedia.org/T164419) (owner: 10Krinkle) [16:47:10] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:47:28] you know by this time that temporary solutions are not temporary :-9 [16:47:54] ^mutante :-) [16:48:43] (03PS1) 10Gehel: elasticsearch - silence parse field deprecation logs [puppet] - 10https://gerrit.wikimedia.org/r/353105 [16:48:45] well in the grand scheme isn't all human existence temporary? ;) [16:49:12] no, I plan to upload my brain to the internets [16:49:28] you first [16:50:28] I'll save my sophomoric philosophical rant on that topic for another time [16:51:14] * bd808 has read a lot of Rudy Rucker books which obviously makes him an expert [16:51:30] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [16:51:48] (03CR) 10DCausse: [C: 031] elasticsearch - silence parse field deprecation logs [puppet] - 10https://gerrit.wikimedia.org/r/353105 (owner: 10Gehel) [16:52:22] (03PS1) 10Dereckson: Revert "Create Autor and Portal namespaces on Spanish Wikisource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353107 (https://phabricator.wikimedia.org/T164195) [16:53:46] bd808: twentyafterfour: you're currently busy on Tin? [16:54:02] Dereckson: not yet [16:54:16] Dereckson: I think I just have a shell there in a tmux session. not changing anything [16:54:28] * twentyafterfour logs out [16:54:31] twentyafterfour: can I have it five minutes for an emergency UBN deploy? [16:54:34] ok [16:54:38] go for it [16:54:41] (03CR) 10Dereckson: [C: 032] "Emergency deployment, lot of pages broken." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353107 (https://phabricator.wikimedia.org/T164195) (owner: 10Dereckson) [16:55:26] Dereckson: will that work even though antoine ran the namespace dedup script? [16:55:29] dedupe [16:55:32] (03PS1) 10Smalyshev: Enable archive search on select wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353108 [16:55:54] 06Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T164953#3252383 (10Volans) [16:55:55] 06Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T164841#3252385 (10Volans) [16:56:12] greg-g: yes, the ns will be restored *but* all the pages will be in main namespace, under "Autor:" name, we'll have to rename them [16:56:18] (03Merged) 10jenkins-bot: Revert "Create Autor and Portal namespaces on Spanish Wikisource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353107 (https://phabricator.wikimedia.org/T164195) (owner: 10Dereckson) [16:56:24] that probably renamed a few pages from :Author:blah to <104>:blah [16:56:27] (03CR) 10jenkins-bot: Revert "Create Autor and Portal namespaces on Spanish Wikisource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353107 (https://phabricator.wikimedia.org/T164195) (owner: 10Dereckson) [16:56:33] Dereckson: gotcha, thanks for helping [16:56:34] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3252388 (10Volans) [16:56:36] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T164955#3252390 (10Volans) [16:59:12] (03PS3) 10Ayounsi: Add new logstash LVS service [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) [17:00:03] (03PS2) 10Nschaaf: Add QuickSurvey for reader segmentation research [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353053 (https://phabricator.wikimedia.org/T131949) [17:00:21] 06Operations, 10ops-codfw: mw2098 failed to come up after reboot - https://phabricator.wikimedia.org/T164959#3252410 (10MoritzMuehlenhoff) [17:00:40] (03PS3) 10Nschaaf: Add QuickSurvey for reader segmentation research [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353053 (https://phabricator.wikimedia.org/T131949) [17:01:02] ACKNOWLEDGEMENT - Host mw2098 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T164959 [17:03:47] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3252428 (10Gilles) 400m misses means 154 requests per second. It would at least triple the load on Thumbor. Might be possible if/once we've repurp... [17:03:47] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Revert "Create Autor and Portal namespaces on Spanish Wikisource" (PT164195) (duration: 00m 43s) [17:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:18] Here we are [17:05:19] https://es.wikisource.org/wiki/%C3%8Dndice:Nueva_arte_de_cocina.djvu [17:05:37] paladox: regarding https://gerrit.wikimedia.org/r/#/c/350767 where did you tested it? [17:05:49] labs [17:05:51] phab-tin [17:06:04] but not deployment-prep, right? [17:06:13] yep [17:06:17] ok, thanks [17:06:25] your welcome :) [17:11:25] (03CR) 10Andrew Bogott: [C: 031] "This is fine to merge as long as the puppet compiler confirms. I can test it later on if you don't get there first." [puppet] - 10https://gerrit.wikimedia.org/r/352660 (owner: 10Dzahn) [17:11:50] (03CR) 10Andrew Bogott: [C: 031] "This is fine to merge as long as the puppet compiler confirms. I can test it later on if you don't get there first." [puppet] - 10https://gerrit.wikimedia.org/r/352636 (owner: 10Dzahn) [17:12:12] (03PS1) 10Andrew Bogott: Californium: include ldap client tools [puppet] - 10https://gerrit.wikimedia.org/r/353110 (https://phabricator.wikimedia.org/T162097) [17:15:10] (03CR) 10Andrew Bogott: [C: 032] Californium: include ldap client tools [puppet] - 10https://gerrit.wikimedia.org/r/353110 (https://phabricator.wikimedia.org/T162097) (owner: 10Andrew Bogott) [17:17:12] bblack, gehel - want to discuss headers? [17:17:38] MaxSem: I'm not gonna have time today, sorry! [17:18:34] MaxSem: in general I think we should start from sort of documenting what we really want the behavior to be (in terms of both client and varnish caching) and then work backwards from there as to what headers the app should set and what VCL should do [17:19:10] MaxSem: right now maps VCL lacks any kind of Cache-Control header hacking/suppression on the way to the client (unlike e.g. cache_text) [17:19:19] (sorry for butting in) but im willing to help write documentation [17:19:31] MaxSem: so the CC header you're sending now, is also wide open to client interpretation too [17:19:55] I'm not sending it yet :) [17:20:07] (but in any case, I'm literally just trying to get my fingers off the keyboard and run out the door) [17:20:18] do it then [17:20:31] we'll talk later=) [17:20:56] !log installing groovy security updates [17:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:03] MaxSem: I guess that answers the question ... [17:24:29] MaxSem: I'm going to run get something to eat before our next meeting... [17:24:30] Dereckson: feel free to ping RainbowSprinkles for help, btw [17:29:21] (03PS1) 10RobH: decommission mira [puppet] - 10https://gerrit.wikimedia.org/r/353116 [17:29:45] (03PS2) 10RobH: decommission mira [puppet] - 10https://gerrit.wikimedia.org/r/353116 [17:29:47] (03CR) 10Mobrovac: "PCC - https://puppet-compiler.wmflabs.org/6377/ . Plenty of changes for the dev cluster, but they all seem benign." [puppet] - 10https://gerrit.wikimedia.org/r/352851 (owner: 10Giuseppe Lavagetto) [17:31:05] 06Operations, 10Page-Previews, 06Performance-Team, 06Reading-Web-Backlog, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3252563 (10Tbayer) >>! In T70861#3206917, @Gilles wrote: > Link interaction seems like a viable candidate in the quanti... [17:32:54] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3252579 (10RobH) [17:33:42] (03PS1) 10Dzahn: site.pp: consistent quoting for role names [puppet] - 10https://gerrit.wikimedia.org/r/353117 [17:33:44] (03PS1) 10Dzahn: parsoid: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353118 [17:33:46] (03PS1) 10Dzahn: dnsrecursor: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353119 [17:33:48] (03PS1) 10Dzahn: thumbor: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353120 [17:33:50] (03PS1) 10Dzahn: poolcounter: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353121 [17:33:52] (03PS1) 10Dzahn: puppetmaster::backend: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353122 [17:33:54] (03PS1) 10Dzahn: authdns::server: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353123 [17:33:56] (03PS1) 10Dzahn: syslog::centralserver: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353124 [17:33:58] (03PS1) 10Dzahn: ganeti: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353125 [17:36:55] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3252602 (10RobH) [17:37:21] greg-g: ok [17:37:55] (03CR) 10Dzahn: "this looks good to me, just the change in mariadb grants means that has to also be deployed by DBA afaict. if it wouldn't have that i woul" [puppet] - 10https://gerrit.wikimedia.org/r/353116 (owner: 10RobH) [17:38:55] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 15User-Dereckson: Create /community-beacon alternative entry point - https://phabricator.wikimedia.org/T155929#3252605 (10AndyRussG) [17:41:00] (03PS4) 10Aaron Schulz: Enable $wgEnableWANCacheReaper for testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339245 [17:42:42] (03PS3) 10RobH: decommission mira [puppet] - 10https://gerrit.wikimedia.org/r/353116 [17:44:36] (03PS4) 10RobH: decommission mira [puppet] - 10https://gerrit.wikimedia.org/r/353116 [17:46:08] (03CR) 10RobH: [C: 032] "After discussion with Jaime via IRC, the wikitech grant file edit is fine to keep it in (there was confusion if it would be ok, hence its " [puppet] - 10https://gerrit.wikimedia.org/r/353116 (owner: 10RobH) [17:46:17] (03PS5) 10RobH: decommission mira [puppet] - 10https://gerrit.wikimedia.org/r/353116 [17:46:38] !log setting db1056's cpu scaling_governor to performance, rather than powersave [17:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:12] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3252658 (10RobH) [17:50:37] (03PS4) 10Nschaaf: Add QuickSurvey for reader segmentation research [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353053 (https://phabricator.wikimedia.org/T131949) [17:50:43] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3238927 (10RobH) [17:54:01] (03PS1) 10RobH: decom mira production dns entries [dns] - 10https://gerrit.wikimedia.org/r/353131 [17:54:49] (03CR) 10RobH: [C: 032] decom mira production dns entries [dns] - 10https://gerrit.wikimedia.org/r/353131 (owner: 10RobH) [17:56:11] PROBLEM - Apache HTTP on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.080 second response time [17:56:34] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3252702 (10RobH) [17:56:47] RECOVERY - MariaDB Slave Lag: s1 on db1067 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [17:57:21] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.366 second response time [17:57:59] 06Operations, 10DBA: remove mira wikitech grants - https://phabricator.wikimedia.org/T164968#3252704 (10RobH) [17:58:32] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3238927 (10RobH) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170510T1800). Please do the needful. [18:00:06] Smalyshev and Krinkle: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:25] bah! I have some things to add! [18:00:39] addshore: add them, there are still free slots. [18:00:44] I can SWAT ths evening. [18:00:54] silly timezones, that totally snuck up on me [18:02:04] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3252729 (10RobH) a:05RobH>03Papaul [18:02:39] (03CR) 10Addshore: Put Cognate in write mode for all wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352569 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [18:02:40] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3238927 (10RobH) @Papaul: Please go ahead and wipe the disks on this system, and then pull it from the rack entirely for decommission, updating racktables and then assigning back to m... [18:02:43] (03PS2) 10Addshore: Put Cognate in write mode for all wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352569 (https://phabricator.wikimedia.org/T164407) [18:03:24] Dereckson: my 3 are added [18:03:33] Krinkle: do we currently use the CodeReview extension extensively? [18:03:36] Yes [18:03:40] Oh, extensively [18:03:46] on mediawiki.org for the old commits? [18:03:50] We need to find a migration patch for that data. [18:03:55] o/ [18:04:10] okay, let's swat it so [18:04:27] Dereckson: I suppose not *that* much, but aside from 'us', 'people' do view the Special:Code pages and they spam logs, so this fixes that. [18:04:46] sounds reasonnable enough [18:04:49] here [18:05:37] addshore: ack'ed [18:06:28] addshore: what bug https://gerrit.wikimedia.org/r/#/c/353090/ fix? [18:06:31] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [18:06:51] PROBLEM - Host mr1-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.194) [18:06:53] Dereckson: missing dep, ahh, I believe someone has filed a ticket, let me find it [18:07:01] PROBLEM - Host asw-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:07:23] robh: anything ongoing in ulsfo? ^^^ [18:07:29] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352569 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [18:07:46] Dereckson: hmmph, no ticket actually [18:08:45] (03PS2) 10Krinkle: mwgrep: If --title is set, don't also require '*.js/.css' [puppet] - 10https://gerrit.wikimedia.org/r/349351 [18:08:51] RECOVERY - MariaDB Slave Lag: s1 on db2071 is OK: OK slave_sql_lag Replication lag: 50.90 seconds [18:08:52] (03PS3) 10Krinkle: mwgrep: Add --etitle option [puppet] - 10https://gerrit.wikimedia.org/r/349352 [18:09:01] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:863:ffff::6) [18:09:01] RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [18:09:12] RECOVERY - MariaDB Slave Lag: s1 on db2062 is OK: OK slave_sql_lag Replication lag: 0.05 seconds [18:09:12] (03Merged) 10jenkins-bot: Put Cognate in write mode for all wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352569 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [18:09:12] RECOVERY - MariaDB Slave Lag: s1 on db2016 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:09:12] RECOVERY - MariaDB Slave Lag: s1 on db2070 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:09:20] (03CR) 10jenkins-bot: Put Cognate in write mode for all wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352569 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [18:09:21] RECOVERY - MariaDB Slave Lag: s1 on db2048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:09:21] RECOVERY - MariaDB Slave Lag: s1 on db2034 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:09:21] RECOVERY - MariaDB Slave Lag: s1 on db2069 is OK: OK slave_sql_lag Replication lag: 0.28 seconds [18:09:31] RECOVERY - MariaDB Slave Lag: s1 on db2042 is OK: OK slave_sql_lag Replication lag: 0.16 seconds [18:10:09] Dereckson: I forgot to add https://gerrit.wikimedia.org/r/#/c/350899/ (mw-config minor interwiki.php update) - can we add that one still (after the others) [18:10:14] * Dereckson nods [18:10:24] Thanks [18:11:42] addshore: Put Cognate in write mode for all wiktionaries live on mwdebug1002 [18:11:54] Dereckson: ack [18:12:18] Dereckson: thats looks good to sync [18:13:28] syncing [18:14:05] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Put Cognate in write mode for all wiktionaries (T164407) (duration: 00m 42s) [18:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:13] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [18:15:03] 06Operations, 06DC-Ops, 10netops: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3252768 (10ayounsi) [18:15:38] ACKNOWLEDGEMENT - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T164970 [18:15:38] ACKNOWLEDGEMENT - Host mr1-ulsfo IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:863:ffff::6) Ayounsi https://phabricator.wikimedia.org/T164970 [18:15:38] ACKNOWLEDGEMENT - Host mr1-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.194) Ayounsi https://phabricator.wikimedia.org/T164970 [18:15:38] ACKNOWLEDGEMENT - Host asw-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T164970 [18:16:16] PROBLEM - Check whether ferm is active by checking the default input chain on tegmen is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [18:16:25] PROBLEM - Check systemd state on tegmen is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:16:35] 06Operations, 06DC-Ops, 10netops: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3252782 (10RobH) I emailed support to reboot it via power cable removal: > Support, > > In remotely administering our mr1-ulsfo Juniper SRX100 device, it locked up and is unresponsive to our attempts to connec... [18:17:09] and all looks fine [18:17:44] Dereckson: would need https://gerrit.wikimedia.org/r/#/c/353106/ on terbium to test [18:19:43] SMalyshev: ack'ed [18:20:02] and I see it's merged [18:22:15] it seems tegmen systemd has a unit that failed [18:26:44] SMalyshev: live on terbium and on mwdebug1002.eqiad.wmnet [18:27:27] Dereckson: seems to be working fine [18:28:10] SMalyshev: okay syncinb [18:28:21] Dereckson: thanks! [18:28:40] Krinkle: check on Tin the state of /srv/mediawiki-staging/php-1.30.0-wmf.1/extensions/CodeReview and tell me if you're happy to consider this to our current HEAD [18:30:19] Dereckson: Not sure I follow. That directory on tin, doesn't include the commit. [18:30:31] Krinkle: indeed, it's the state before I can rebase to take yours [18:30:58] !log dereckson@tin Synchronized php-1.30.0-wmf.1/extensions/CirrusSearch/maintenance/forceSearchIndex.php: Fix index usage on archive indexing (duration: 00m 42s) [18:31:01] https://github.com/wikimedia/mediawiki-extensions-CodeReview/commits/wmf/1.30.0-wmf.1 [18:31:02] https://github.com/wikimedia/mediawiki-extensions-CodeReview/commits/master [18:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:07] There were no commits in between [18:31:15] It's in sync, and applies cleanly [18:31:31] Krinkle: perfect current state = origin/wmf/1.30.0-wmf.1 before your commit so [18:32:04] Dereckson: According to what? [18:32:21] There is no dirty submodule, no untracked files, nothing. Looks perfect in sync to me, git says so [18:32:35] * ffbc182 - (HEAD, origin/wmf/1.30.0-wmf.1, master) build: add jakub-onderka/php-console-highlighter (5 days ago) [18:33:23] Krinkle: ah your fix is for 21, not 1? [18:33:31] Dereckson: No, it's for wmf.1 [18:33:43] It was merged in master as the first commit since the branch was created. [18:33:50] It applies cleanly, plain cherry-pick [18:33:51] Krinkle: okay got the issue: https://gerrit.wikimedia.org/r/#/c/353101/ [18:34:00] you put it on wmf/1.29.0-wmf.21 [18:34:06] Did I? [18:34:07] Oops [18:34:09] Sorry [18:34:38] No problem, do a revert commit on 21, and a commit to cherry pick it again to 1 [18:34:39] Yeah, that was silly [18:34:42] Yep [18:37:01] Done :) [18:37:24] (03CR) 10Daniel Kinzler: "We do want to get rid of this, but we have to be careful when, exactly. Too early, and we may lose data. Too late, and the script may fai" [puppet] - 10https://gerrit.wikimedia.org/r/352574 (https://phabricator.wikimedia.org/T140890) (owner: 10Ladsgroup) [18:38:25] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 57.63 ms [18:39:55] is the SWAT over? can I add a patch to it? [18:41:33] gilles: still going [18:42:00] gilles: you can, es [18:42:21] thanks! adding to the wiki as we speak [18:42:55] (03PS1) 10EBernhardson: Logstash match_mapping_type still uses string, not text [puppet] - 10https://gerrit.wikimedia.org/r/353150 (https://phabricator.wikimedia.org/T164823) [18:46:17] addshore: dd "oojs-ui" dep to ext.TwoColConflict.filterOptionsJs live *for wmf21 only* on mwdebug1002 [18:46:44] Urbanecm: ping? [18:47:13] wait? does +2ing on extensions generate commits to update the submodules on core automatically now? [18:47:16] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [18:47:41] ack [18:48:06] gilles: yes [18:48:36] hah, ok :) so I guess all I need is a sync for those, I'll delete my manual patch that was updating the submodules [18:48:41] Since like forever ;-) [18:48:44] I mean there's one in jenkins now [18:48:58] gilles: so what we want is three commits on the relevant extensions for wmf/1.30.0-wmf.1 [18:49:02] Dereckson: looks good [18:49:06] that shows you how long it's been since I requested an extension change to be swatted [18:49:13] no need to check .1 as .21 and .1 have the same code :) [18:49:28] Dereckson: I'll link to those [18:50:26] ok [18:50:44] (03CR) 10EBernhardson: "I've already pushed this template update to beta, tested that it works, then pushed it into the prod logstash cluster. It needs to be depl" [puppet] - 10https://gerrit.wikimedia.org/r/353150 (https://phabricator.wikimedia.org/T164823) (owner: 10EBernhardson) [18:53:51] !log dereckson@tin Synchronized php-1.29.0-wmf.21/extensions/TwoColConflict/: Add "oojs-ui" dep to ext.TwoColConflict.filterOptionsJs (duration: 00m 42s) [18:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:45] ACKNOWLEDGEMENT - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/2: down - Cust: Airport Express WiFi APBR Ayounsi https://phabricator.wikimedia.org/T86541 [18:57:02] !log mr1-ulsfo: request system snapshot media internal slice alternate; request system reboot [18:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:11] XioNoX: ^ [18:57:15] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:58:33] thx [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170510T1900). Please do the needful. [19:00:35] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [19:01:38] twentyafterfour: SWAT is still ongoing [19:01:54] Dereckson: no problem [19:02:17] Dereckson: I'm here, do you need me still? [19:03:07] Urbanecm: yes, did you follow the es.wikisource issue? [19:03:35] Urbanecm: if so, you can prepare a change to reapply new namespaces with new id [19:04:01] Dereckson: No, I don't know about any issue. But I'll have a look, give me a sec [19:04:45] Urbanecm: theere were a conflict between these id and the ProofredPage extensions one, between Autor: and Index: [19:04:56] Dereckson: Oh. Did I broke anything? [19:05:11] *break [19:05:45] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 56.76 ms [19:05:57] Urbanecm: Yes, the Index: pages wasn't reacheable, as there now were Autor: instead [19:06:01] weren't [19:06:15] (03PS4) 10Krinkle: Update interwiki map (disable __list sorting) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350899 (https://phabricator.wikimedia.org/T145337) [19:06:22] Is there a list of free numbers of NSs anywhere? [19:06:51] no, but you can check before https://es.wikisource.org/api.php?action=query&meta=siteinfo&siprop=namespaces%7Cnamespacealiases [19:09:02] Dereckson: Okay. As I can see the change was reverted, so I can prepare new patch and double check it would be okay? Am I right? [19:09:09] no, but you can check before https://es.wikisource.org/api.php?action=query&meta=siteinfo&siprop=namespaces%7Cnamespacealiases /w [19:09:39] https://es.wikisource.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces%7Cnamespacealiases [19:09:50] yes, that's right for the new patch [19:10:11] Urbanecm: https://www.mediawiki.org/wiki/Extension_default_namespaces [19:10:16] That is the public registry [19:10:21] Krinkle: Thank you! [19:10:29] generally 100 110 are available [19:10:36] Dereckson: Okay, I'll prepare it. Thank you for your notification. [19:10:41] so you fall in a vicious trap [19:10:56] Only two numbers are free? [19:11:10] the full range [19:11:18] 100 to 110 is the range usually used for custom ns [19:11:28] Dereckson: Okay. Thank you. [19:11:33] and historical ProofreadPage deployment used that too, as the namespaces existed before the extension [19:12:16] Dereckson: Perhaps we can document the WMF custom per-wiki namespaces on this page as well, or at least the general range + the most commonly used ones that exist on mulitple wikis [19:12:22] Wikia has a reserved range as well [19:12:32] * Dereckson nods [19:13:09] i can assist with documentation. [19:13:26] Too bad 10x overlaps with SMW [19:13:28] Oh well [19:13:41] SMW isnt used anymore? [19:14:13] Not by us. [19:14:19] But it does exist in the world. [19:14:20] :) [19:14:22] gilles: can you fix Deployments table? You want https://gerrit.wikimedia.org/r/353112 as URL for TimedMediaHandler [19:14:32] Is page NS defined by wmf-config or an extension for wikisource? [19:14:52] Dereckson: maybe I made a mistake, the links are wrong? [19:15:08] gilles: you linked to the master change, 353112 is the cherry pick one [19:15:14] ah sorry [19:15:16] I'll fix [19:15:26] yes the one you've just pasted is the right one [19:16:06] fixed, sorry about that, too much multitasking [19:16:28] Krinkle: i think its IS/CS [19:17:08] !log dereckson@tin Synchronized php-1.30.0-wmf.1/extensions/TwoColConflict/: Add "oojs-ui" dep to ext.TwoColConflict.filterOptionsJs (duration: 00m 42s) [19:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:21] thanks Dereckson! [19:18:29] You're welcome addshore. [19:19:31] (03CR) 10Dereckson: [C: 032] Update interwiki map (disable __list sorting) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350899 (https://phabricator.wikimedia.org/T145337) (owner: 10Krinkle) [19:20:41] (03Merged) 10jenkins-bot: Update interwiki map (disable __list sorting) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350899 (https://phabricator.wikimedia.org/T145337) (owner: 10Krinkle) [19:20:50] (03CR) 10jenkins-bot: Update interwiki map (disable __list sorting) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350899 (https://phabricator.wikimedia.org/T145337) (owner: 10Krinkle) [19:21:04] gilles: your changes are live on mwdebug1002 [19:21:21] thanks, I'll go test [19:21:34] Krinkle: interwiki map live on mwdebug1002 too [19:21:41] k, checking [19:24:02] Dereckson: works fine [19:25:35] Urbanecm: so if I read https://es.wikisource.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces%7Cnamespacealiases 100/101 was ok for Portal, pick any other one for Autor, like 106 or 108 or 110 [19:25:39] gilles: ack'ed [19:26:07] Dereckson: verified, yay [19:26:12] Dereckson: Thank you. [19:26:31] !log dereckson@tin Synchronized php-1.30.0-wmf.1/extensions/PagedTiffHandler/PagedTiffHandler_body.php: Store original media dimensions as additional header (T150741) (duration: 00m 42s) [19:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:39] T150741: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741 [19:27:12] Krinkle: ack'ed, syncing [19:27:50] !log dereckson@tin Synchronized wmf-config/interwiki.php: Interwiki map update (disable __list sorting, T145337) (duration: 00m 41s) [19:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:59] T145337: Parsoid should use the "mw" interwiki prefix instead of the "mediawikiwiki" one - https://phabricator.wikimedia.org/T145337 [19:28:34] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3253098 (10Paladox) Hi, other users have this problem see https://github.com/Icinga/icinga2/issues/4614 (note that the user is using icinga 2 as the backend but usi... [19:28:36] !log dereckson@tin Synchronized php-1.30.0-wmf.1/extensions/PdfHandler/PdfHandler_body.php: Store original media dimensions as additional header (T150741) (duration: 00m 42s) [19:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:43] Dereckson: Thank you. [19:29:25] !log dereckson@tin Synchronized php-1.30.0-wmf.1/extensions/TimedMediaHandler/: Store original media dimensions as additional header (T150741) (duration: 00m 43s) [19:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:45] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:55] PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:33] twentyafterfour: I'm done [19:30:35] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:44] Dereckson: thanks [19:30:47] sorry for the extra delay [19:31:30] (03PS1) 10Urbanecm: Create Autor and Portal namespaces on Spanish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353157 (https://phabricator.wikimedia.org/T164195) [19:31:41] Dereckson: ^^ Uploaded. [19:32:08] (03PS1) 1020after4: group1 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353158 [19:32:10] (03CR) 1020after4: [C: 032] group1 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353158 (owner: 1020after4) [19:32:15] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [19:32:26] (03CR) 10Dereckson: [C: 031] "Those aren't assigned to ProofreadPage, so we're safe." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353157 (https://phabricator.wikimedia.org/T164195) (owner: 10Urbanecm) [19:33:35] !log deploying 1.30.0-wmf.1 to group1 wikis. refs T162954 [19:33:40] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353158 (owner: 1020after4) [19:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:42] T162954: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954 [19:35:55] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353158 (owner: 1020after4) [19:35:55] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [19:43:53] (03PS1) 10Dzahn: add private/files/releases/id_rsa.upload FAKE secret key [labs/private] - 10https://gerrit.wikimedia.org/r/353160 [19:44:56] (03PS2) 10Dzahn: add private/files/releases/id_rsa.upload FAKE secret key [labs/private] - 10https://gerrit.wikimedia.org/r/353160 [19:45:14] (03PS3) 10Dzahn: add private/files/releases/id_rsa.upload FAKE secret key [labs/private] - 10https://gerrit.wikimedia.org/r/353160 [19:45:16] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.1 [19:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:54] (03CR) 10Paladox: [C: 031] add private/files/releases/id_rsa.upload FAKE secret key [labs/private] - 10https://gerrit.wikimedia.org/r/353160 (owner: 10Dzahn) [19:47:24] (03PS2) 10Smalyshev: Enable archive search on select wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353108 [19:47:43] (03CR) 10Dzahn: [C: 032] "To fix puppet on deployment-tin/deployment-mira." [labs/private] - 10https://gerrit.wikimedia.org/r/353160 (owner: 10Dzahn) [19:48:19] uh oh... [19:48:21] (03CR) 10Dzahn: [V: 032 C: 032] add private/files/releases/id_rsa.upload FAKE secret key [labs/private] - 10https://gerrit.wikimedia.org/r/353160 (owner: 10Dzahn) [19:48:21] Notice: Undefined property: stdClass::$ores_damaging_threshold in /srv/mediawiki/php-1.30.0-wmf.1/extensions/ORES/includes/Hooks.php on line 547 [19:48:25] RECOVERY - Host asw-ulsfo is UP: PING WARNING - Packet loss = 73%, RTA = 75.93 ms [19:48:55] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.60 ms [19:49:19] (03PS1) 1020after4: group1 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353161 [19:49:22] (03CR) 1020after4: [C: 032] group1 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353161 (owner: 1020after4) [19:50:19] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353161 (owner: 1020after4) [19:50:28] (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353161 (owner: 1020after4) [19:51:08] !log rolling group1 back to 1.29.0-wmf.21 due to T164984 [19:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:17] T164984: Notice: Undefined property: stdClass::$ores_damaging_threshold in /srv/mediawiki/php-1.30.0-wmf.1/extensions/ORES/includes/Hooks.php on line 547 - https://phabricator.wikimedia.org/T164984 [19:51:35] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.21 [19:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:15] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [19:56:16] (03PS3) 10Smalyshev: Enable archive search on select wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353108 (https://phabricator.wikimedia.org/T162302) [19:57:09] Dereckson: I'm still seeing the git hash from before the git submodule updates on testwiki/mediawiki.org is that expected? [19:57:57] gilles: your fixes were for wmf29.21 or wmf30.1? [19:58:07] wmf30.1 [19:58:48] and you've got the headers fine when you ask a document? [19:59:04] if so, could be a local cache on your browser, or on varnish [19:59:13] it doesn't work right now, I mean it's a if my change isnt' applied, and it seemed to work on the debug machin [19:59:53] okay, let's check the tin state and resync if needed [20:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170510T2000). [20:00:15] PROBLEM - Host mr1-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.194) [20:00:15] no parsoid deploy today [20:00:15] PROBLEM - Host asw-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [20:00:30] Nothing for ORES today [20:01:09] gilles: so if we look https://tools.wmflabs.org/versions/ only test.wikipedia.org should have it [20:01:14] Dereckson: could be that my change doesn't work, the problem is that the TLS terminator doesn't seem to like the debug header, so I couldn't upload large files on the test machine [20:01:21] gilles: twentyafterfour rollbacked the train [20:01:29] gotcha [20:01:46] I thought mediawiki.org was in group0 with testwiki [20:01:58] It is [20:02:54] so the part I don't understand is why Special:Version on testwiki, now that it's supposed to be deployed there, doesn't have its git hash next to the version number point to the latest git submodule update commit [20:03:06] instead, it's pointing to the commit before those [20:03:25] hmm [20:04:05] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [20:04:15] RECOVERY - Host asw-ulsfo is UP: PING WARNING - Packet loss = 61%, RTA = 76.41 ms [20:04:55] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.26 ms [20:05:52] so, first, your change is correctly applied on Tin [20:05:55] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [20:06:05] I think maybe the git info only gets updated with a full scap? [20:06:06] now let's compare file hashes [20:06:10] could be twentyafterfour [20:06:23] pretty sure what twentyafterfour said is true [20:06:35] ah, that's possible, I don't know how that works. and hashat did do a full scap this afternoon [20:06:37] because scap is still rsync, the git hash doesn't get updated when we swat [20:06:39] hashar [20:06:53] yeah. only with full scap. I think there is a patch somewhere to add it to all syncs but it was sketchy [20:06:59] makes sense [20:07:33] On Tin: [20:07:37] 399f91d85b677dc4f84b0f6d9490b63e handlers/OggHandler/OggHandler.php [20:07:40] b4ce8ba779d08d55dc92fe4fbc403279 handlers/WebMHandler/WebMHandler.php [20:07:45] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 75.94 ms [20:08:05] Let's now check on a random server [20:09:15] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 56.34 ms [20:10:24] 06Operations, 06DC-Ops, 10netops: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3253261 (10RobH) a:05RobH>03ayounsi So united layer support rebooted this for us, and now @ayounsi is working on recovery. [20:11:17] gilles: for TimedMediaHandler, hashes on mw1263 match the yours [20:11:25] and the ones on tin [20:11:26] thanks [20:16:56] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/6378/" [puppet] - 10https://gerrit.wikimedia.org/r/353095 (owner: 10Faidon Liambotis) [20:23:15] PROBLEM - HP RAID on ms-be1037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [20:28:45] okay now SWAT is done, we can redeploy Autor: on WS, as they want Portal: back [20:29:15] (03CR) 10Dereckson: [C: 032] "To restore Portal: namespace mainly, and allow more useful renames for Autor:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353157 (https://phabricator.wikimedia.org/T164195) (owner: 10Urbanecm) [20:29:33] now the train is done [20:30:40] (03Merged) 10jenkins-bot: Create Autor and Portal namespaces on Spanish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353157 (https://phabricator.wikimedia.org/T164195) (owner: 10Urbanecm) [20:30:52] (03CR) 10jenkins-bot: Create Autor and Portal namespaces on Spanish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353157 (https://phabricator.wikimedia.org/T164195) (owner: 10Urbanecm) [20:31:30] !log bsitzmann@tin Started deploy [mobileapps/deploy@5d3b34a]: Update mobileapps to 75b135e [20:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:15] RECOVERY - HP RAID on ms-be1037 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:33:36] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Restore Autor: and Portal: namespaces on es.wikisource (T164195) (duration: 00m 42s) [20:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:45] T164195: Create Autor and Portal namespaces on Spanish Wikisource - https://phabricator.wikimedia.org/T164195 [20:34:29] (03PS1) 10Dzahn: deployment::server: move add_ip6_mapped back to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/353171 [20:35:25] !log bsitzmann@tin Finished deploy [mobileapps/deploy@5d3b34a]: Update mobileapps to 75b135e (duration: 03m 55s) [20:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:09] !log Run namespaceDupes.php on es.wikisource (T164195) [20:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:55] !log demon@tin Synchronized README: no-op, comaster sync (duration: 00m 42s) [20:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:45] 06Operations, 10ops-codfw, 10netops: codfw: kubernetes200[1-4] switch port configuration - https://phabricator.wikimedia.org/T164988#3253327 (10Papaul) [20:41:13] (03PS1) 10Aaron Schulz: Move swift auth URL to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353173 [20:41:31] AaronSchulz: o/ [20:42:27] if you have time today/tomorrow, can you review https://gerrit.wikimedia.org/r/#/c/351854 ? [20:43:31] (03CR) 10Brian Wolff: "Just as a reminder, updateCategoryCollation.php must be run on bawiki after this is deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) (owner: 10Amire80) [20:44:03] 06Operations, 06Performance-Team, 10Thumbor, 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3253362 (10Gilles) Deployed on testwiki, w... [20:45:55] (03PS5) 10Ayounsi: Various LibreNMS improvements [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) [20:47:18] (03CR) 10Aaron Schulz: [C: 031] Re-enable persistent connection to Redis for jobrunners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351854 (https://phabricator.wikimedia.org/T125735) (owner: 10Elukey) [20:48:17] elukey: lgtm [20:50:33] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3253369 (10RobH) [20:50:35] 06Operations, 10ops-codfw, 10netops: codfw: kubernetes200[1-4] switch port configuration - https://phabricator.wikimedia.org/T164988#3253366 (10RobH) 05Open>03Resolved a:03RobH done! [20:50:45] AaronSchulz: thanks! [20:53:58] 06Operations, 06DC-Ops, 10netops: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3253392 (10ayounsi) Its internal storage is corrupted, @faidon re-did the steps listed on https://phabricator.wikimedia.org/T127295 And I restored the last working configuration based on rancid and jnt. Ran "reque... [20:54:58] (03PS2) 10Dzahn: deployment::server: move add_ip6_mapped back to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/353171 [20:55:01] !log restart hhvm on mw1268 (HHVM 3.12, HPHP::Treadmill::getAgeOldestRequest issue) [20:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:35] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.182 second response time [20:56:45] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.096 second response time [20:56:46] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 75097 bytes in 0.251 second response time [20:56:49] 06Operations, 10netops, 13Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3253405 (10ayounsi) [20:57:41] 06Operations, 06Commons, 10media-storage: More missing 'original' files on Commons - https://phabricator.wikimedia.org/T163068#3253406 (10MoritzMuehlenhoff) p:05Triage>03Normal [20:58:09] 06Operations, 10DBA: dbtree: don't return 200 on error pages - https://phabricator.wikimedia.org/T163143#3253407 (10MoritzMuehlenhoff) p:05Triage>03Normal [20:58:39] 06Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T164953#3253408 (10MoritzMuehlenhoff) a:03Papaul [21:01:13] (03CR) 10Dzahn: [C: 032] "this way it's a real no-op: http://puppet-compiler.wmflabs.org/6380/" [puppet] - 10https://gerrit.wikimedia.org/r/353171 (owner: 10Dzahn) [21:09:16] 06Operations, 06Community-Tech, 10MediaWiki-CrossWikiWatchlist, 10hardware-requests, 07Crosswiki: Acquire new hardware for hosting cross-wiki watchlist database - https://phabricator.wikimedia.org/T142538#3253439 (10RobH) 05Open>03declined Since this is requesting a prototype by the DBAs, before allo... [21:10:59] 06Operations, 10hardware-requests: decom arsenic: (was: reclaim arsenic as spare) - https://phabricator.wikimedia.org/T83340#3253455 (10RobH) [21:12:53] 06Operations, 10Page-Previews, 06Performance-Team, 06Reading-Web-Backlog, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3253457 (10Gilles) 05Open>03Resolved a:03Gilles FYI we usually link to the RAIL guidelines because they're easy to... [21:15:19] (03CR) 10Dzahn: "@AndrewBogott Ok, thank you. the biggest part for me is usually figuring out which nodes i really have to compile it on. Here, i did: http" [puppet] - 10https://gerrit.wikimedia.org/r/352636 (owner: 10Dzahn) [21:15:24] (03PS1) 10Thcipriani: Scap: add beta canary_dashboard_url config value [puppet] - 10https://gerrit.wikimedia.org/r/353179 (https://phabricator.wikimedia.org/T164981) [21:15:30] (03PS2) 10Dzahn: openstack: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352636 [21:21:43] (03CR) 10Dzahn: "this fixed the puppet run on deployment-tin and deployment-mira now. (in addition to https://gerrit.wikimedia.org/r/#/c/353160/ and adjust" [puppet] - 10https://gerrit.wikimedia.org/r/353171 (owner: 10Dzahn) [21:23:48] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decommision nembus - https://phabricator.wikimedia.org/T162928#3253472 (10RobH) 05Open>03Resolved a:05RobH>03None [21:24:37] 06Operations: archiva artifact links point to 127.0.0.1 - https://phabricator.wikimedia.org/T164993#3253475 (10Smalyshev) [21:28:25] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3253496 (10aaron) The job run rate and type run rate graphs seem uninteresting in that... [21:28:39] 06Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T164953#3253497 (10Papaul) @MoritzMuehlenhoff Thanks for the update we are working on this on T149006 [21:29:18] 06Operations, 10ops-codfw, 10netops: codfw: kubernetes200[1-4] switch port configuration - https://phabricator.wikimedia.org/T164988#3253504 (10Papaul) @RobH Thanks. [21:30:45] (03CR) 10Dzahn: [C: 032] openstack: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352636 (owner: 10Dzahn) [21:32:18] (03CR) 10Paladox: "Hi, this broke some deployment instances on labs." [puppet] - 10https://gerrit.wikimedia.org/r/350765 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [21:34:05] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3253516 (10jcrespo) I think this was a one-time user doing multiple purges, we can clo... [21:38:09] !log maxsem@tin Started deploy [kartotherian/deploy@9401f38]: Try https://gerrit.wikimedia.org/r/#/c/352886/ and https://gerrit.wikimedia.org/r/#/c/353184/ on test hosts [21:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:20] !log twentyafterfour@tin Synchronized php-1.30.0-wmf.1/extensions/ORES/includes/Hooks.php: sync fix for T164984 refs T162954 (duration: 00m 42s) [21:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:29] T162954: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954 [21:38:29] T164984: Notice: Undefined property: stdClass::$ores_damaging_threshold in /srv/mediawiki/php-1.30.0-wmf.1/extensions/ORES/includes/Hooks.php on line 547 - https://phabricator.wikimedia.org/T164984 [21:41:13] (03Draft1) 10Paladox: lvs/configuation: Fix inline template so it will use :: if $ipaddress does not exist or puppet can't find it [puppet] - 10https://gerrit.wikimedia.org/r/353186 [21:41:16] (03PS2) 10Paladox: lvs/configuation: Fix inline template so it will use :: if $ipaddress does not exist or puppet can't find it [puppet] - 10https://gerrit.wikimedia.org/r/353186 [21:43:25] (03CR) 10jerkins-bot: [V: 04-1] lvs/configuation: Fix inline template so it will use :: if $ipaddress does not exist or puppet can't find it [puppet] - 10https://gerrit.wikimedia.org/r/353186 (owner: 10Paladox) [21:46:50] (03PS3) 10Paladox: lvs/configuation: Fix inline template so it will use :: if $ipaddress does not exist or puppet can't find it [puppet] - 10https://gerrit.wikimedia.org/r/353186 [21:51:16] (03PS4) 10Paladox: lvs/configuation: Fix inline template so it will use :: if $ipaddress does not exist or puppet can't find it [puppet] - 10https://gerrit.wikimedia.org/r/353186 [21:54:43] (03PS5) 10Paladox: lvs/configuation: Fix inline template so it will use :: if $ipaddress does not exist or puppet can't find it [puppet] - 10https://gerrit.wikimedia.org/r/353186 [21:55:04] (03CR) 10Volans: "yes, we were already aware of it. Thanks for letting us know." [puppet] - 10https://gerrit.wikimedia.org/r/350765 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [21:57:01] (03Abandoned) 10Paladox: lvs/configuation: Fix inline template so it will use :: if $ipaddress does not exist or puppet can't find it [puppet] - 10https://gerrit.wikimedia.org/r/353186 (owner: 10Paladox) [21:57:10] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/350765 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [21:59:56] (03CR) 10Dzahn: "hey, this was uploaded in 2014 and bumped in 2015, any updates?" [puppet] - 10https://gerrit.wikimedia.org/r/145018 (owner: 10ArielGlenn) [22:05:11] (03CR) 10Dzahn: "does this run on a prod none (can it be compiled?)" [puppet] - 10https://gerrit.wikimedia.org/r/352660 (owner: 10Dzahn) [22:05:36] (03CR) 10Paladox: "Bump, any updates on this?" [puppet] - 10https://gerrit.wikimedia.org/r/240945 (owner: 10Alex Monk) [22:07:27] (03CR) 10Alex Monk: "If there were an update, someone would have already written it." [puppet] - 10https://gerrit.wikimedia.org/r/240945 (owner: 10Alex Monk) [22:08:17] (03CR) 10Paladox: "Bump" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [22:08:56] (03PS3) 10Paladox: deployment_server: Fix misspelt variable [puppet] - 10https://gerrit.wikimedia.org/r/353094 [22:16:37] (03CR) 10Dzahn: [C: 032] deployment_server: Fix misspelt variable [puppet] - 10https://gerrit.wikimedia.org/r/353094 (owner: 10Paladox) [22:16:49] thanks ^^ [22:17:11] thanks for fixing my typo [22:19:16] Your welcome :) [22:24:40] (03PS3) 10Paladox: Gerrit: Enable g1 gc as we now use java 8 [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) [22:25:14] (03CR) 10Paladox: Gerrit: Enable g1 gc as we now use java 8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [22:25:41] (03CR) 10Paladox: [C: 031] puppetmaster: /var/lib/puppet/ssl should be group puppet [puppet] - 10https://gerrit.wikimedia.org/r/248302 (owner: 10Alexandros Kosiaris) [22:27:12] (03PS2) 10Dzahn: webperf: Remove remnants of webperf::asset_check [puppet] - 10https://gerrit.wikimedia.org/r/353104 (https://phabricator.wikimedia.org/T164419) (owner: 10Krinkle) [22:27:35] (03CR) 10Paladox: [C: 031] monitoring/base: add NRPE command to check temperature (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [22:29:57] paladox: first see the actual comments [22:30:06] "for some reason it's always CRIT for other non-temperature reasons, like power supply or case instrusion (when testing on lead), even when specifying the -T temperature only. i was trying to exclude other checks but it's not good like this yet" [22:30:09] ok [22:30:56] i simply dont know why that plugin doesnt work, separate from preicse [22:34:17] (03CR) 10Dzahn: [C: 04-1] monitoring/base: add NRPE command to check temperature (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [22:34:55] (03CR) 10Dzahn: [C: 032] webperf: Remove remnants of webperf::asset_check [puppet] - 10https://gerrit.wikimedia.org/r/353104 (https://phabricator.wikimedia.org/T164419) (owner: 10Krinkle) [22:35:05] ok [22:35:25] if you want to amend, go ahead, paladox [22:35:38] re: precise-check. see packages.ubuntu.com [22:35:39] ok [22:36:03] it seems like it was only precise, dunno why i did not check for that back then [22:36:12] check for trusty i mean [22:40:52] ok [22:51:45] (03CR) 10Dzahn: [C: 031] "hmm, interesting. it's still like this but when running puppet on a master i don't see the change on every run as described here." [puppet] - 10https://gerrit.wikimedia.org/r/248302 (owner: 10Alexandros Kosiaris) [22:56:05] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [22:57:05] (03Draft1) 10Paladox: contint: Install php5-gimp and php7.0-gmp [puppet] - 10https://gerrit.wikimedia.org/r/353194 [22:57:07] (03PS2) 10Paladox: contint: Install php5-gmp and php7.0-gmp [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) [22:57:11] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3253725 (10DStrine) [22:57:16] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170510T2300). Please do the needful. [23:00:05] RoanKattouw, James_F, and SMalyshev: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:11] o/ [23:00:15] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [23:00:30] here [23:00:39] * James_F waves. [23:00:57] RoanKattouw: You SWATing? [23:01:07] note: train still hasn't ran for group1 [23:01:08] I guess I'll have to [23:01:23] twentyafterfour: Do we. [23:01:26] Bah [23:01:36] Do we need to wait? [23:01:38] it's fine if you want to do swat [23:01:46] I can deploy afterwards [23:01:56] James's patches contain i18n changes [23:01:58] still waiting on jenkins anyway... [23:02:01] So I think I'll let those ride with your scap, twentyafterfour [23:02:06] ok [23:02:13] The others are small, I can do them quickly [23:02:23] That was going to be my suggestion, yeah. [23:02:35] (03PS4) 10Catrope: Enable archive search on select wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353108 (https://phabricator.wikimedia.org/T162302) (owner: 10Smalyshev) [23:02:41] (03CR) 10Catrope: [C: 032] Enable archive search on select wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353108 (https://phabricator.wikimedia.org/T162302) (owner: 10Smalyshev) [23:03:05] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [23:05:01] (03Merged) 10jenkins-bot: Enable archive search on select wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353108 (https://phabricator.wikimedia.org/T162302) (owner: 10Smalyshev) [23:05:48] (03CR) 10jenkins-bot: Enable archive search on select wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353108 (https://phabricator.wikimedia.org/T162302) (owner: 10Smalyshev) [23:06:35] RoanKattouw: can I have it on terbium for tests [23:06:57] SMalyshev: Done, test away [23:07:27] Whoa, there's a gate-and-submit-swat queue? Sweet! [23:07:30] RoanKattouw: excellent, works just fine [23:08:39] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable archive search on select wikis (T162302) (duration: 00m 41s) [23:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:47] T162302: Add archive index to wikis - https://phabricator.wikimedia.org/T162302 [23:08:50] thanks [23:13:48] !log catrope@tin Synchronized php-1.30.0-wmf.1/extensions/WikimediaEvents/: T164617 (duration: 00m 42s) [23:13:54] (03PS3) 10Dzahn: dynamicproxy: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352660 [23:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:56] T164617: Get stats on how frequently RC Page related links (at page top) are clicked - https://phabricator.wikimedia.org/T164617 [23:15:10] Also going to throw in https://gerrit.wikimedia.org/r/#/c/353196/1 if we're going to have i18n changes anyway [23:15:33] One that's merged and pulled, the way will be clear for twentyafterfour to scap [23:17:46] PROBLEM - SSH on ms-be1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:17:55] OK, all done [23:17:59] twentyafterfour: It's all yours [23:18:14] There are some unsynced changes but they're all in i18n/*.json files, so I'm relying on your scap to pick those up [23:18:22] RoanKattouw: thanks! [23:18:35] RECOVERY - SSH on ms-be1021 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [23:19:56] !log twentyafterfour@tin Started scap: Sync fix for T164983 plus i18n files leftover from swat. refs T162954 [23:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:04] T164983: Notice: Undefined index: quality in /srv/mediawiki/php-1.30.0-wmf.1/includes/media/Jpeg.php on line 208 - https://phabricator.wikimedia.org/T164983 [23:20:05] T162954: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954 [23:20:06] 06Operations, 10hardware-requests: decom arsenic: (was: reclaim arsenic as spare) - https://phabricator.wikimedia.org/T83340#3253774 (10RobH) [23:20:20] (03CR) 10Dzahn: [C: 032] dynamicproxy: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352660 (owner: 10Dzahn) [23:22:16] 06Operations, 10hardware-requests: decom arsenic: (was: reclaim arsenic as spare) - https://phabricator.wikimedia.org/T83340#3253775 (10RobH) I don't see it on any of the switch stack descriptions in eqiad, so likely it was done previously. [23:22:58] (03CR) 10Dzahn: "i ran puppet on tools-proxy-01 and novaproxy-01 and confirmed nothing happened" [puppet] - 10https://gerrit.wikimedia.org/r/352660 (owner: 10Dzahn) [23:24:46] (03PS1) 10RobH: decom arsenic [dns] - 10https://gerrit.wikimedia.org/r/353197 [23:25:08] 06Operations, 10hardware-requests, 13Patch-For-Review: decom arsenic: (was: reclaim arsenic as spare) - https://phabricator.wikimedia.org/T83340#3253779 (10RobH) [23:25:17] (03CR) 10RobH: [C: 032] decom arsenic [dns] - 10https://gerrit.wikimedia.org/r/353197 (owner: 10RobH) [23:27:14] 06Operations, 10hardware-requests, 13Patch-For-Review: decom arsenic: (was: reclaim arsenic as spare) - https://phabricator.wikimedia.org/T83340#3253781 (10RobH) 05Open>03Resolved a:05RobH>03None [23:27:27] (03CR) 10Dzahn: [C: 032] site.pp: consistent quoting for role names [puppet] - 10https://gerrit.wikimedia.org/r/353117 (owner: 10Dzahn) [23:30:46] 06Operations, 10ops-codfw, 10hardware-requests: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3253787 (10RobH) 05Open>03Resolved a:05RobH>03None [23:32:55] (03CR) 10Dzahn: "confirming this doesnt change anyting on analytics1003" [puppet] - 10https://gerrit.wikimedia.org/r/353000 (owner: 10Dzahn) [23:33:24] (03PS2) 10Dzahn: hadoop: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/353000 [23:34:10] (03PS3) 10Dzahn: hadoop: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/353000 [23:35:45] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:36:25] (03CR) 10Dzahn: [C: 032] hadoop: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/353000 (owner: 10Dzahn) [23:36:35] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [23:40:44] (03PS2) 10Dzahn: site.pp: consistent quoting for role names [puppet] - 10https://gerrit.wikimedia.org/r/353117 [23:41:23] (03CR) 10Dzahn: [V: 032 C: 032] site.pp: consistent quoting for role names [puppet] - 10https://gerrit.wikimedia.org/r/353117 (owner: 10Dzahn) [23:45:21] (03PS3) 10Dzahn: kafkatee: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352999 [23:45:29] (03PS4) 10Dzahn: kafkatee: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352999 [23:49:02] (03CR) 10Dzahn: ""You have searched for packages that names contain php7.0-gmp in suite(s) stable, all sections, and all architectures. "" [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) (owner: 10Paladox) [23:50:33] !log twentyafterfour@tin Finished scap: Sync fix for T164983 plus i18n files leftover from swat. refs T162954 (duration: 30m 37s) [23:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:42] T164983: Notice: Undefined index: quality in /srv/mediawiki/php-1.30.0-wmf.1/includes/media/Jpeg.php on line 208 - https://phabricator.wikimedia.org/T164983 [23:50:42] T162954: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954 [23:51:03] (03CR) 10Dzahn: [C: 04-1] "E: Unable to locate package php7.0-gmp" [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) (owner: 10Paladox) [23:51:51] (03CR) 10Dzahn: [C: 032] kafkatee: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352999 (owner: 10Dzahn) [23:55:55] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds.