[00:15:22] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [106250000.0] [00:19:12] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [93750000.0] [01:09:53] PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: puppet fail [01:37:19] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:48:17] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 17.24% of data above the critical threshold [106250000.0] [02:14:55] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: puppet fail [02:22:54] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.4) (duration: 09m 34s) [02:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:50] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jun 6 02:28:50 UTC 2016 (duration 5m 56s) [02:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:06] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [02:55:05] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [106250000.0] [03:43:35] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [93750000.0] [03:46:11] (03PS8) 10GWicke: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [03:46:13] (03CR) 10GWicke: Logstash_checker script for canary deploys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [03:47:41] (03CR) 10jenkins-bot: [V: 04-1] Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [04:07:32] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 62 failures [05:05:42] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [05:08:51] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 11.018 second response time [05:10:53] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 15.962 second response time [05:27:51] PROBLEM - Check size of conntrack table on db2034 is CRITICAL: Timeout while attempting connection [05:28:12] PROBLEM - salt-minion processes on db2034 is CRITICAL: Timeout while attempting connection [05:28:21] PROBLEM - MariaDB Slave SQL: s1 on db2034 is CRITICAL: Timeout while attempting connection [05:28:31] PROBLEM - MariaDB Slave IO: s1 on db2034 is CRITICAL: Timeout while attempting connection [05:28:32] PROBLEM - dhclient process on db2034 is CRITICAL: Timeout while attempting connection [05:28:33] PROBLEM - DPKG on db2034 is CRITICAL: Timeout while attempting connection [05:28:36] PROBLEM - MariaDB disk space on db2034 is CRITICAL: Timeout while attempting connection [05:28:41] PROBLEM - HP RAID on db2034 is CRITICAL: Timeout while attempting connection [05:29:13] PROBLEM - configured eth on db2034 is CRITICAL: Timeout while attempting connection [05:29:21] PROBLEM - Disk space on db2034 is CRITICAL: Timeout while attempting connection [05:29:44] PROBLEM - mysqld processes on db2034 is CRITICAL: Timeout while attempting connection [05:29:45] PROBLEM - puppet last run on db2034 is CRITICAL: Timeout while attempting connection [05:31:20] hrmm, db2034 seems locked up via serial console [05:32:38] PROBLEM - Host db2034 is DOWN: PING CRITICAL - Packet loss = 100% [05:34:36] !log db2034 locked up via serial console. details on T137084, rebooting since its unresponsive to ssh or serial. [05:34:38] T137084: db2034 crash - https://phabricator.wikimedia.org/T137084 [05:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:36:59] RECOVERY - configured eth on db2034 is OK: OK - interfaces up [05:37:09] RECOVERY - Host db2034 is UP: PING OK - Packet loss = 0%, RTA = 36.89 ms [05:37:09] RECOVERY - Disk space on db2034 is OK: DISK OK [05:37:12] RECOVERY - MariaDB disk space on db2034 is OK: DISK OK [05:37:50] RECOVERY - DPKG on db2034 is OK: All packages OK [05:37:50] RECOVERY - dhclient process on db2034 is OK: PROCS OK: 0 processes with command name dhclient [05:38:09] RECOVERY - salt-minion processes on db2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:38:09] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 14 minutes ago with 0 failures [05:38:18] RECOVERY - HP RAID on db2034 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [05:38:20] RECOVERY - Check size of conntrack table on db2034 is OK: OK: nf_conntrack is 0 % full [05:38:48] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 68 failures [05:56:09] PROBLEM - MariaDB Slave Lag: s1 on db2034 is CRITICAL: CRITICAL slave_sql_lag could not connect [06:07:34] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 688 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5638898 keys - replication_delay is 688 [06:10:51] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM, 13Patch-For-Review: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2356694 (10Joe) >>! In T131749#2338188, @Joe wrote: > What is left to do: > > [x] Make mediawiki::cgroup work with systemd... [06:15:23] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5622564 keys - replication_delay is 16 [06:18:08] !log aaron@tin Synchronized php-1.28.0-wmf.4/includes/cache/LinkBatch.php: c2ba764f38e44e7 (duration: 00m 30s) [06:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:23:54] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.505 second response time [06:26:51] (03CR) 10Nikerabbit: Beta: Enable Compact Language Links for new users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291908 (https://phabricator.wikimedia.org/T136161) (owner: 10KartikMistry) [06:29:54] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:14] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:24] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: puppet fail [06:30:34] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [06:30:49] 06Operations, 10ops-codfw, 10DBA: db2034 crash - https://phabricator.wikimedia.org/T137084#2356708 (10jcrespo) It seems there was a RAID controller failure: > A controller failure event occurred prior to this power-up We had similar issues on T130702. We may need a general upgrade of all machines with simi... [06:30:55] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: puppet fail [06:31:53] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:43] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:55] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:34] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [106250000.0] [06:39:13] 06Operations, 10ops-codfw, 10DBA: db2034 degraded RAID - https://phabricator.wikimedia.org/T136583#2356722 (10jcrespo) a:05jcrespo>03Papaul This host crashed today: T137084 due to a RAID controller failure. Are we still sure this was safe? Papaul, could you please follow up with support? [06:41:21] 06Operations, 10ops-codfw, 10DBA: db2034 crash - https://phabricator.wikimedia.org/T137084#2356726 (10jcrespo) This host being down was creating log noise due to health checks (no users affected): https://logstash.wikimedia.org/#dashboard/temp/AVUkao15_LTxu7wl9U3S [06:56:25] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:24] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:24] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:44] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:57:44] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:54] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:24] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:13] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [93750000.0] [07:03:48] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:11:50] (03PS11) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [07:18:31] (03PS4) 10Giuseppe Lavagetto: mediawiki::cgroup: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/292113 (https://phabricator.wikimedia.org/T131749) [07:18:33] (03PS2) 10Giuseppe Lavagetto: hhvm::debug: fix hhvm-dump-debug on jessie [puppet] - 10https://gerrit.wikimedia.org/r/292160 [07:18:35] (03PS1) 10Giuseppe Lavagetto: facter: remove references to $::memorytotal [puppet] - 10https://gerrit.wikimedia.org/r/292896 (https://phabricator.wikimedia.org/T131749) [07:24:43] 06Operations: Integrate jessie 8.6 point release - https://phabricator.wikimedia.org/T137087#2356766 (10MoritzMuehlenhoff) [07:24:56] 06Operations: Integrate jessie 8.5 point release - https://phabricator.wikimedia.org/T137087#2356779 (10MoritzMuehlenhoff) [07:26:29] (03PS2) 10Giuseppe Lavagetto: facter: remove references to $::memorytotal [puppet] - 10https://gerrit.wikimedia.org/r/292896 (https://phabricator.wikimedia.org/T131749) [07:26:31] (03PS5) 10Giuseppe Lavagetto: mediawiki::cgroup: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/292113 (https://phabricator.wikimedia.org/T131749) [07:26:33] (03PS3) 10Giuseppe Lavagetto: hhvm::debug: fix hhvm-dump-debug on jessie [puppet] - 10https://gerrit.wikimedia.org/r/292160 [07:37:40] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet last ran 2 days ago [07:37:59] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [106250000.0] [07:39:39] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:40:49] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: Puppet has 1 failures [07:46:26] <_joe_> :)/win 15 [07:51:30] RECOVERY - Disk space on labmon1001 is OK: DISK OK [07:51:48] PROBLEM - carbon-frontend-relay service on labmon1001 is CRITICAL: CRITICAL - Expecting active but unit carbon-frontend-relay is inactive [07:53:40] RECOVERY - carbon-frontend-relay service on labmon1001 is OK: OK - carbon-frontend-relay is active [07:57:09] !log enabling GTID on pending coredb servers on eqiad [07:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:57:27] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::cgroup: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/292113 (https://phabricator.wikimedia.org/T131749) (owner: 10Giuseppe Lavagetto) [08:06:28] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:06:51] (03PS1) 10Giuseppe Lavagetto: mediawiki::cgroup: don't restart upon changes to the unit [puppet] - 10https://gerrit.wikimedia.org/r/292897 [08:12:29] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: Puppet has 1 failures [08:12:38] <_joe_> this is me ^^ [08:12:44] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::cgroup: don't restart upon changes to the unit [puppet] - 10https://gerrit.wikimedia.org/r/292897 (owner: 10Giuseppe Lavagetto) [08:17:18] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [93750000.0] [08:17:19] <_joe_> !log rebooting mw1262 [08:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:20:48] 06Operations, 10DBA, 07Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#2356819 (10jcrespo) [08:20:50] 06Operations, 10DBA, 10MediaWiki-Database, 07Performance: Implement GTID replication on MariaDB 10 servers - https://phabricator.wikimedia.org/T133385#2356814 (10jcrespo) 05Open>03Resolved a:03jcrespo GTID rolled in on all production coredb servers. Resolving now, although it will still be applied to... [08:23:29] (03PS4) 10Giuseppe Lavagetto: hhvm::debug: fix hhvm-dump-debug on jessie [puppet] - 10https://gerrit.wikimedia.org/r/292160 [08:26:12] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm::debug: fix hhvm-dump-debug on jessie [puppet] - 10https://gerrit.wikimedia.org/r/292160 (owner: 10Giuseppe Lavagetto) [08:27:20] ACKNOWLEDGEMENT - Disk space on elastic1012 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 67137 MB (13% inode=99%): Gehel Rebalancing in progress [08:27:21] ACKNOWLEDGEMENT - Disk space on elastic1013 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 78437 MB (15% inode=99%): Gehel Rebalancing in progress [08:27:51] !log lowering elasticsearch high watermark on eqiad cluster to rebalance disk space [08:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:28:27] (03PS1) 10Nikerabbit: Use BotPassword for TNBot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292898 (https://phabricator.wikimedia.org/T110766) [08:31:21] (03PS2) 10Nikerabbit: Use bot password for TNBot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292898 (https://phabricator.wikimedia.org/T110766) [08:33:28] (03PS1) 10Ppchelko: Change-Prop: Enable file transclusions updates. [puppet] - 10https://gerrit.wikimedia.org/r/292899 [08:33:42] (03PS3) 10Giuseppe Lavagetto: facter: remove references to $::memorytotal [puppet] - 10https://gerrit.wikimedia.org/r/292896 (https://phabricator.wikimedia.org/T131749) [08:34:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] facter: remove references to $::memorytotal [puppet] - 10https://gerrit.wikimedia.org/r/292896 (https://phabricator.wikimedia.org/T131749) (owner: 10Giuseppe Lavagetto) [08:42:48] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures [08:45:09] !log change-prop deployed 9b04e475 [08:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:49:15] <_joe_> tin is a known problem I am going to fix soon-ish [08:51:18] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:08] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.011 second response time [08:58:07] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [09:01:18] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:16] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.758 second response time [09:10:45] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Increase time before alert for elasticsearch disk space issues - https://phabricator.wikimedia.org/T136702#2356873 (10Gehel) [09:12:24] (03PS1) 10Giuseppe Lavagetto: mediawiki::cgroup: fix upstart job [puppet] - 10https://gerrit.wikimedia.org/r/292901 [09:12:54] !log installing dpkg bugfix updates on jessie systems [09:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:14:29] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::cgroup: fix upstart job [puppet] - 10https://gerrit.wikimedia.org/r/292901 (owner: 10Giuseppe Lavagetto) [09:15:21] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2356875 (10jcrespo) Tuesday, whenever you start working and are available (my afternoon)? [09:17:37] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:19:07] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.505 second response time [09:21:07] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.013 second response time [09:25:52] (03PS1) 10Giuseppe Lavagetto: hhvm: fix stacktraces script for jessie [puppet] - 10https://gerrit.wikimedia.org/r/292902 [09:27:48] (03PS2) 10Giuseppe Lavagetto: hhvm: fix stacktraces script for jessie [puppet] - 10https://gerrit.wikimedia.org/r/292902 [09:29:06] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: fix stacktraces script for jessie [puppet] - 10https://gerrit.wikimedia.org/r/292902 (owner: 10Giuseppe Lavagetto) [09:34:56] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.820 second response time [09:38:27] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:39:26] PROBLEM - puppet last run on mw2066 is CRITICAL: CRITICAL: puppet fail [09:40:48] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.730 second response time [09:44:20] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup paylvs1005-8 - https://phabricator.wikimedia.org/T136881#2356960 (10mark) [09:56:41] (03PS3) 10Jforrester: Enable VisualEditor by default for logged-in users on four Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292274 [09:56:59] (03CR) 10Jforrester: [C: 031] "Ready to go this UTC afternoon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292274 (owner: 10Jforrester) [09:58:01] (03PS2) 10Jforrester: Switch Wikivoyages to Single Edit Tab mode for VE Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292614 [09:58:48] (03CR) 10Jforrester: [C: 031] "Ready to go this UTC afternoon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292614 (owner: 10Jforrester) [09:59:01] (03CR) 10Elukey: "LGTM. I have a question about the cassandra common hiera settings and the AQS ones (hieradata/role/common/cassandra.yaml). Are those compl" [puppet] - 10https://gerrit.wikimedia.org/r/290860 (owner: 10Eevans) [10:03:42] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2357015 (10Paladox) [10:03:48] (03PS1) 10Giuseppe Lavagetto: install_server: fix partman recipe for 2-disks appservers [puppet] - 10https://gerrit.wikimedia.org/r/292904 [10:05:50] (03PS2) 10Giuseppe Lavagetto: install_server: fix partman recipe for 2-disks appservers [puppet] - 10https://gerrit.wikimedia.org/r/292904 [10:07:44] RECOVERY - puppet last run on mw2066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:09:22] (03CR) 10Giuseppe Lavagetto: [C: 032] install_server: fix partman recipe for 2-disks appservers [puppet] - 10https://gerrit.wikimedia.org/r/292904 (owner: 10Giuseppe Lavagetto) [10:22:47] (03PS1) 10Giuseppe Lavagetto: install_server: re-tab netboot/mw-raid1.cfg [puppet] - 10https://gerrit.wikimedia.org/r/292905 [10:23:08] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] install_server: re-tab netboot/mw-raid1.cfg [puppet] - 10https://gerrit.wikimedia.org/r/292905 (owner: 10Giuseppe Lavagetto) [10:25:03] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:45] !log rebooting kafka100[12] for kernel upgrades (one at the time with de-pool/re-pool actions) [10:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:53] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.049 second response time [10:27:28] !log elukey@palladium conftool action : set/pooled=no; selector: kafka1001.eqiad.wmnet [10:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:27] (03PS2) 10Ppchelko: Change-Prop: Enable file transclusions updates. [puppet] - 10https://gerrit.wikimedia.org/r/292899 [10:33:15] !log installing perl updates (bugfixes and CVE-2015-8853) [10:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:34:27] 06Operations, 10Gerrit: Gerrit replication to furud.codfw.wmnet fails with: reject HostKey: furud.codfw.wmnet - https://phabricator.wikimedia.org/T136822#2357048 (10hashar) What I suspect is the ssh host key on furud.codfw.wmnet changed or it got removed from ytterbium:/var/lib/gerrit2/.ssh/known_hosts causin... [10:36:59] (03PS1) 10Jcrespo: [WIP] Puppetize netboot installer creation [puppet] - 10https://gerrit.wikimedia.org/r/292906 [10:44:59] !log elukey@palladium conftool action : set/pooled=yes; selector: kafka1001.eqiad.wmnet [10:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:46:08] !log re-added kafka1001 to eventbus.svc.eqiad.wmflabs without rebooting since some concerns were raised from the Services team. Will have a discussion with them before proceeding. [10:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:49:54] (03PS2) 10Jcrespo: [WIP] Puppetize netboot installer creation [puppet] - 10https://gerrit.wikimedia.org/r/292906 [11:21:20] (03PS2) 10Ema: update-ocsp-all: write output to logfile [puppet] - 10https://gerrit.wikimedia.org/r/291752 (https://phabricator.wikimedia.org/T132835) [11:23:22] 06Operations, 10Mail, 10OTRS, 06TCB-Team, 10WMDE-Fundraising-Software: add WMDE mx's to SpamAssassin trusted hosts to fix SPF softfails - https://phabricator.wikimedia.org/T83499#2357133 (10Danny_B) [11:26:59] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:07] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [11:53:06] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:53:08] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:54:57] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.008 second response time [12:01:30] 06Operations: Integrate jessie 8.5 point release - https://phabricator.wikimedia.org/T137087#2357199 (10MoritzMuehlenhoff) 05Open>03Resolved The following updates from jessie 8.5 have been deployed: clamav dpkg hivex libdatetime-timezone-perl libksba nmap perl postgresql-9.1 postgresql-9.4 quota xapian-core... [12:03:36] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.826 second response time [12:03:47] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [12:05:37] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5635239 keys - replication_delay is 0 [12:17:36] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.597 second response time [12:26:48] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time [12:27:57] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:30:57] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.016 second response time [12:33:09] (03CR) 10BBlack: Extend the %{format}t timestamp formatter with (begin|end): prefixes (034 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [12:34:29] !log restarted Jenkins, deadlock in IRC plugin [12:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:34:54] (03CR) 10BBlack: [C: 031] update-ocsp-all: write output to logfile [puppet] - 10https://gerrit.wikimedia.org/r/291752 (https://phabricator.wikimedia.org/T132835) (owner: 10Ema) [12:36:47] (03CR) 10BBlack: [C: 031] nginx: remove jessie conditional for mount [puppet/nginx] - 10https://gerrit.wikimedia.org/r/291278 (owner: 10Dzahn) [12:37:45] (03CR) 10BBlack: [C: 031] redirect moon.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/292772 (https://phabricator.wikimedia.org/T136557) (owner: 10Dzahn) [12:37:55] (03CR) 10BBlack: [C: 031] add moon.wikimedia.org, point to cluster [dns] - 10https://gerrit.wikimedia.org/r/292771 (https://phabricator.wikimedia.org/T136557) (owner: 10Dzahn) [12:38:24] (03CR) 10BBlack: [C: 031] ores: Add varnish backend in the misc cluster [puppet] - 10https://gerrit.wikimedia.org/r/292543 (https://phabricator.wikimedia.org/T124203) (owner: 10Alexandros Kosiaris) [12:38:51] (03PS4) 10BBlack: varnish: move errorpage.html from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/290876 (owner: 10Dzahn) [12:38:57] (03CR) 10BBlack: [C: 031] varnish: move errorpage.html from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/290876 (owner: 10Dzahn) [12:43:52] 06Operations, 07Tracking: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#2357300 (10faidon) [12:44:56] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.009 second response time [12:46:57] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.027 second response time [12:49:58] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [12:51:06] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Puppet has 1 failures [12:52:57] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.586 second response time [12:54:57] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 3.333 second response time [13:16:54] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:55] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [13:18:54] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.009 second response time [13:20:45] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.026 second response time [13:27:39] 06Operations, 10Traffic: californium and gallium ferm rules should disallow public access to port 80 - https://phabricator.wikimedia.org/T137106#2357419 (10BBlack) [13:34:40] (03CR) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes (034 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [13:35:24] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.003 second response time [13:36:39] (03CR) 10Ema: [C: 032 V: 032] update-ocsp-all: write output to logfile [puppet] - 10https://gerrit.wikimedia.org/r/291752 (https://phabricator.wikimedia.org/T132835) (owner: 10Ema) [13:37:15] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.019 second response time [13:39:24] PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: puppet fail [13:47:06] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.021 second response time [13:51:15] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 9.134 second response time [14:00:25] !log dropping database blog from m1 [14:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:38] (03PS1) 10Gehel: Activate more logs on postgresql for maps. [puppet] - 10https://gerrit.wikimedia.org/r/292925 [14:04:25] (03CR) 10Gehel: [C: 032] Activate more logs on postgresql for maps. [puppet] - 10https://gerrit.wikimedia.org/r/292925 (owner: 10Gehel) [14:06:37] (03CR) 10Mobrovac: "LGTM, but one thing to be aware of is that cassandra::instance uses 'admin,team_services' as the contact group(s), but I guess that's OK s" [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [14:07:25] RECOVERY - puppet last run on rdb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:10:34] !log dropping old bugzilla databases from m1 [14:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:35] (03CR) 10Yurik: "is it possible to use this conf only for one server (maps2001) ? I feel its not needed on all the other ones" [puppet] - 10https://gerrit.wikimedia.org/r/292925 (owner: 10Gehel) [14:20:04] (03PS1) 10BBlack: Remove TLS bits from internal sites behind cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/292928 (https://phabricator.wikimedia.org/T132685) [14:20:06] (03PS1) 10BBlack: ssl_ciphersuite: standardize STS preload [puppet] - 10https://gerrit.wikimedia.org/r/292929 (https://phabricator.wikimedia.org/T132685) [14:20:08] (03PS1) 10BBlack: Set includeSub/preload for wikimedia.org in VCL [puppet] - 10https://gerrit.wikimedia.org/r/292930 (https://phabricator.wikimedia.org/T132685) [14:20:21] jouncebot: next [14:20:21] In 0 hour(s) and 39 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160606T1500) [14:24:27] (03CR) 10jenkins-bot: [V: 04-1] Remove TLS bits from internal sites behind cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/292928 (https://phabricator.wikimedia.org/T132685) (owner: 10BBlack) [14:25:06] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 17.24% of data above the critical threshold [106250000.0] [14:25:35] (03CR) 10jenkins-bot: [V: 04-1] ssl_ciphersuite: standardize STS preload [puppet] - 10https://gerrit.wikimedia.org/r/292929 (https://phabricator.wikimedia.org/T132685) (owner: 10BBlack) [14:25:45] we should really re-run the whole puppet repo before upgrading checks [14:25:56] having them pop in in work that didn't create them is !#$^@#$%^$ [14:26:45] oh nevermind, this isn't that, this is the "arrow-alignment checks are awful because they cause excess whitespace diff-noise" problem instead [14:27:36] (03CR) 10jenkins-bot: [V: 04-1] Set includeSub/preload for wikimedia.org in VCL [puppet] - 10https://gerrit.wikimedia.org/r/292930 (https://phabricator.wikimedia.org/T132685) (owner: 10BBlack) [14:28:46] (03PS2) 10BBlack: Set includeSub/preload for wikimedia.org in VCL [puppet] - 10https://gerrit.wikimedia.org/r/292930 (https://phabricator.wikimedia.org/T132685) [14:28:48] (03PS2) 10BBlack: ssl_ciphersuite: standardize STS preload [puppet] - 10https://gerrit.wikimedia.org/r/292929 (https://phabricator.wikimedia.org/T132685) [14:28:50] (03PS2) 10BBlack: Remove TLS bits from internal sites behind cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/292928 (https://phabricator.wikimedia.org/T132685) [14:34:22] (03PS3) 10BBlack: Set includeSub/preload for wikimedia.org in VCL [puppet] - 10https://gerrit.wikimedia.org/r/292930 (https://phabricator.wikimedia.org/T132685) [14:34:24] (03PS3) 10BBlack: ssl_ciphersuite: standardize STS preload [puppet] - 10https://gerrit.wikimedia.org/r/292929 (https://phabricator.wikimedia.org/T132685) [14:34:26] (03PS3) 10BBlack: Remove TLS bits from internal sites behind cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/292928 (https://phabricator.wikimedia.org/T132685) [14:34:46] (03PS1) 10Gehel: Activate more logs on postgresql for maps. [puppet] - 10https://gerrit.wikimedia.org/r/292931 [14:40:35] (03PS2) 10Gehel: Activate more logs on postgresql for maps. [puppet] - 10https://gerrit.wikimedia.org/r/292931 [14:42:35] (03CR) 10jenkins-bot: [V: 04-1] Activate more logs on postgresql for maps. [puppet] - 10https://gerrit.wikimedia.org/r/292931 (owner: 10Gehel) [14:43:28] (03CR) 10BBlack: [C: 032] Remove TLS bits from internal sites behind cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/292928 (https://phabricator.wikimedia.org/T132685) (owner: 10BBlack) [14:43:41] (03CR) 10Gehel: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/292931 (owner: 10Gehel) [14:51:35] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [93750000.0] [14:52:16] (03CR) 10Anomie: "Is the plan that I790d39c2 will be backported to all deployed branches at the same time this is deployed?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [14:54:09] (03PS4) 10BBlack: Set includeSub/preload for wikimedia.org in VCL [puppet] - 10https://gerrit.wikimedia.org/r/292930 (https://phabricator.wikimedia.org/T132685) [14:54:11] (03PS4) 10BBlack: ssl_ciphersuite: standardize STS preload [puppet] - 10https://gerrit.wikimedia.org/r/292929 (https://phabricator.wikimedia.org/T132685) [14:54:13] (03PS1) 10BBlack: fix ferm port 80 for californium/gallium [puppet] - 10https://gerrit.wikimedia.org/r/292935 (https://phabricator.wikimedia.org/T137106) [14:54:33] (03PS3) 10Gehel: Activate more logs on postgresql for maps. [puppet] - 10https://gerrit.wikimedia.org/r/292931 [14:56:40] (03PS2) 10BBlack: fix ferm port 80 for californium/gallium [puppet] - 10https://gerrit.wikimedia.org/r/292935 (https://phabricator.wikimedia.org/T137106) [14:56:42] (03PS5) 10BBlack: Set includeSub/preload for wikimedia.org in VCL [puppet] - 10https://gerrit.wikimedia.org/r/292930 (https://phabricator.wikimedia.org/T132685) [14:56:44] (03PS5) 10BBlack: ssl_ciphersuite: standardize STS preload [puppet] - 10https://gerrit.wikimedia.org/r/292929 (https://phabricator.wikimedia.org/T132685) [14:57:54] (03PS1) 10Ema: Don't install apt-show-versions [puppet] - 10https://gerrit.wikimedia.org/r/292936 (https://phabricator.wikimedia.org/T132324) [14:58:11] (03PS2) 10Gergő Tisza: Apply AbuseFilter configuration syntax change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 [14:58:41] * James_F waves in preparation for wm-bot. [14:58:48] No, wait, jouncebot. [14:58:59] (03PS1) 10Mobrovac: Math: Set wgMathFullRestbaseURL to point to wikimedia.org in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292937 (https://phabricator.wikimedia.org/T136205) [14:59:40] (03PS1) 10Gehel: Revert "Send wmf.4 search and ttmserver traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292938 [15:00:05] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160606T1500). [15:00:05] James_F kart_ yurik nikerabbit: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:07] 06Operations, 10Gerrit: Gerrit replication to furud.codfw.wmnet fails with: reject HostKey: furud.codfw.wmnet - https://phabricator.wikimedia.org/T136822#2357545 (10demon) It shoud've been disabled again bleh, furud isn't going to be a thing anymore. [15:00:07] (03CR) 10Gergő Tisza: "> Is the plan that I790d39c2 will be backported to all deployed branches at the same time this is deployed?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [15:00:11] yep [15:00:19] That's the one. [15:00:41] (03PS9) 10GWicke: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [15:00:56] (03CR) 10Anomie: "Although... that plan wouldn't work all that well for Labs in the time between merging the AbuseFilter patch to master and the deploy to p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [15:01:25] around [15:01:27] I can SWAT today. [15:01:37] (03CR) 10BBlack: [C: 032 V: 032] fix ferm port 80 for californium/gallium [puppet] - 10https://gerrit.wikimedia.org/r/292935 (https://phabricator.wikimedia.org/T137106) (owner: 10BBlack) [15:01:44] (03CR) 10Gehel: Revert "Send wmf.4 search and ttmserver traffic to codfw" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292938 (owner: 10Gehel) [15:01:52] plop [15:01:53] jenkins seems to be very slow? [15:02:29] (03PS1) 10Chad: Gerrit: kill replication to furud [puppet] - 10https://gerrit.wikimedia.org/r/292940 [15:02:40] 06Operations, 10Gerrit: Gerrit replication to furud.codfw.wmnet fails with: reject HostKey: furud.codfw.wmnet - https://phabricator.wikimedia.org/T136822#2357557 (10demon) https://gerrit.wikimedia.org/r/#/c/292940/ [15:03:22] (03CR) 10jenkins-bot: [V: 04-1] Set includeSub/preload for wikimedia.org in VCL [puppet] - 10https://gerrit.wikimedia.org/r/292930 (https://phabricator.wikimedia.org/T132685) (owner: 10BBlack) [15:03:25] (03CR) 10Anomie: "Maybe at the bottom" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [15:03:31] Nikerabbit: we still need to use canary servers to test, https://gerrit.wikimedia.org/r/289652 [15:03:38] (03PS2) 10Gehel: Revert "Send wmf.4 search and ttmserver traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292938 [15:03:52] (03CR) 10jenkins-bot: [V: 04-1] ssl_ciphersuite: standardize STS preload [puppet] - 10https://gerrit.wikimedia.org/r/292929 (https://phabricator.wikimedia.org/T132685) (owner: 10BBlack) [15:04:03] !log dropping old outreach databases on m1 [15:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:20] thcipriani: Please see, https://gerrit.wikimedia.org/r/#q,289652,n,z before merge, it require test on canary servers as Krinkle mentioned there. [15:04:35] thcipriani: ie before deploy :) [15:04:42] (03CR) 10DCausse: [C: 031] Revert "Send wmf.4 search and ttmserver traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292938 (owner: 10Gehel) [15:04:43] kart_: ack, thanks [15:04:44] (03CR) 10Anomie: "That assumes we don't have Iefd8d346 without I790d39c2 at any relevant point, but that seems like a safe assumption." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [15:05:03] PROBLEM - Apache HTTP on mw1262 is CRITICAL: Connection refused [15:05:16] 06Operations, 10Traffic, 13Patch-For-Review: californium and gallium ferm rules should disallow public access to port 80 - https://phabricator.wikimedia.org/T137106#2357559 (10BBlack) 05Open>03Resolved a:03BBlack [15:05:34] PROBLEM - mediawiki-installation DSH group on mw1262 is CRITICAL: Host mw1262 is not in mediawiki-installation dsh group [15:05:54] (03CR) 10jenkins-bot: [V: 04-1] Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [15:05:54] PROBLEM - nutcracker port on mw1262 is CRITICAL: Connection refused by host [15:06:14] PROBLEM - nutcracker process on mw1262 is CRITICAL: Connection refused by host [15:06:34] PROBLEM - puppet last run on mw1262 is CRITICAL: Connection refused by host [15:06:36] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for joewalsh - https://phabricator.wikimedia.org/T137110#2357563 (10JoeWalsh) [15:06:41] (03CR) 10BBlack: [C: 031] Math: Set wgMathFullRestbaseURL to point to wikimedia.org in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292937 (https://phabricator.wikimedia.org/T136205) (owner: 10Mobrovac) [15:06:54] PROBLEM - salt-minion processes on mw1262 is CRITICAL: Connection refused by host [15:06:55] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292274 (owner: 10Jforrester) [15:06:55] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.004 second response time [15:07:07] (03CR) 10jenkins-bot: [V: 04-1] Gerrit: kill replication to furud [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [15:07:24] PROBLEM - Check size of conntrack table on mw1262 is CRITICAL: Connection refused by host [15:07:43] PROBLEM - DPKG on mw1262 is CRITICAL: Connection refused by host [15:07:54] PROBLEM - Disk space on mw1262 is CRITICAL: Connection refused by host [15:08:14] PROBLEM - MD RAID on mw1262 is CRITICAL: Connection refused by host [15:08:40] hmm, zuul seems a little slow this morning... [15:08:44] PROBLEM - puppet last run on mw2213 is CRITICAL: CRITICAL: puppet fail [15:09:04] PROBLEM - configured eth on mw1262 is CRITICAL: Connection refused by host [15:09:14] (03Merged) 10jenkins-bot: Enable VisualEditor by default for logged-in users on four Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292274 (owner: 10Jforrester) [15:09:23] PROBLEM - dhclient process on mw1262 is CRITICAL: Connection refused by host [15:10:15] hmm, bast4001 hostkey change? [15:10:37] thcipriani: it was reinstalled, it's normal [15:10:44] paravoid: ack, thank you. [15:10:48] fetch known hosts from another known-to-you bastion [15:10:50] e.g. bast1001 [15:10:58] anyone knows anything about bot passwords? [15:11:14] what is the "bot name" thing in Special:BotPasswords [15:11:29] is that the user name, or any string? [15:11:48] tgr ^ [15:11:52] (03PS2) 10Chad: Gerrit: kill replication to furud [puppet] - 10https://gerrit.wikimedia.org/r/292940 [15:12:29] yurik: it's suffixed to the actual username [15:12:42] so the new login name becomes username@botname [15:12:45] yurik:up to you, the actual username will be @ [15:13:03] (03CR) 10jenkins-bot: [V: 04-1] Gerrit: kill replication to furud [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [15:13:23] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM, 13Patch-For-Review: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2357594 (10Joe) MediaWiki was dropped from the latest debian stable point release, so mediawiki-math-textvc is not availabl... [15:13:36] ?? so confused. So if the account name is yurik, and i set the bot name to zero, what creds should i use to login? [15:13:48] yurik@zero? [15:13:50] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292614 (owner: 10Jforrester) [15:13:52] yurik@zero and the generated password [15:13:58] tgr, gotcha, thx [15:14:05] !log thcipriani@tin Synchronized dblists/visualeditor-default.dblist: SWAT: [[gerrit:292274|Enable VisualEditor by default for logged-in users on four Wikipedias]] PART I (duration: 00m 29s) [15:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:29] (03Merged) 10jenkins-bot: Switch Wikivoyages to Single Edit Tab mode for VE Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292614 (owner: 10Jforrester) [15:14:43] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:292274|Enable VisualEditor by default for logged-in users on four Wikipedias]] PART II (duration: 00m 30s) [15:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:48] ^ James_F check please [15:15:02] (03CR) 10BBlack: Extend the %{format}t timestamp formatter with (begin|end): prefixes (032 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [15:15:17] Cool. [15:15:29] tgr, if the login should only be allowed from our own servers, how should i modify IPAddresses [15:15:48] (03CR) 10Physikerwelt: [C: 031] "That's a good idea. It will improve the effectiveness of the browser image cache, for people that use different language versions of the s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292937 (https://phabricator.wikimedia.org/T136205) (owner: 10Mobrovac) [15:16:00] thcipriani: First one is fine. [15:16:10] yurik: maybe 10.0.0.0/8 ? [15:16:12] Nikerabbit, thx for your help [15:16:21] thcipriani: Second one not having an effect yet? [15:16:25] * James_F refreshes. [15:16:54] PROBLEM - Apache HTTP on mw1262 is CRITICAL: Connection timed out [15:17:05] tgr, bd808, i presume these values can be edited later, right? [15:17:13] James_F: wmf-config? [15:17:15] Krinkle: https://gerrit.wikimedia.org/r/#/c/292614/2/wmf-config/InitialiseSettings.php [15:17:23] Krinkle: It's not working and I can't quite tell why. [15:17:25] yurik: the class dealing with that is MWRestrictions, which uses isIPInRange [15:17:41] yurik: I don't actually remember. [15:17:43] James_F: Is it listed as a tag and loaded as such near extract($globals) ? [15:17:49] if you are asking for the specific IP range for WMF, I don't know what that is [15:18:20] 'wikivoyage.dblist' exists, though a dblist is not needed for it to work. [15:18:32] suffixes also work, but it requires being added to a whitelist [15:18:34] yurik: you can update it, yes [15:18:36] (03CR) 10BBlack: Extend the %{format}t timestamp formatter with (begin|end): prefixes (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [15:18:49] https://github.com/wikimedia/operations-mediawiki-config/blob/4e0806bd0c9332a3282101598e87832b1de1ac7f/wmf-config/wgConf.php#L9-L20 [15:18:52] It's listed, so it should work [15:19:22] Krinkle: Yeah… [15:19:30] James_F: wait, the second one meaning: single tab ve for wikivoyage? [15:19:33] James_F: SET works ofr me now on en,wikiv [15:19:52] I haven't sync'd that one yet if that's what you're looking at. [15:20:16] thcipriani: Ooooh. OK, that'd explain. ;-) [15:20:19] :) [15:20:22] doing now [15:20:25] Thanks. [15:20:28] Krinkle: Are you sure? [15:20:47] thcipriani, i need to update private settings, after which it can be deployed. let me know when you want me to do it [15:21:04] James_F: nvm, indeed. not. [15:21:12] James_F: Rememebr it takes ~5min for startup JS to match [15:21:18] Yeah. [15:21:33] PROBLEM - Disk space on dataset1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/nginx 0 MB (0% inode=99%) [15:21:36] But I see ve-edit/Edit source in html source as well [15:21:39] So yeah, it doesn't work [15:21:49] James_F: Best is to check eval.php to check the value of the config var at run time [15:21:49] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:292614|Switch Wikivoyages to Single Edit Tab mode for VE Beta Feature]] (duration: 00m 24s) [15:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:54] ^ James_F deployed now [15:21:55] Well.. [15:22:13] James_F: Works now :) [15:22:22]
  • Edit
  • [15:22:22] thcipriani: Yup. :-) [15:22:29] (there's a leading space btw) [15:22:35] Krinkle: Yeah, we should fix that. [15:22:38] Probably something still using strings for 'class' instead of array [15:22:44] and using a space prepend [15:22:47] (03PS4) 10Gehel: Activate more logs on postgresql for maps. [puppet] - 10https://gerrit.wikimedia.org/r/292931 [15:23:33] RECOVERY - Disk space on dataset1001 is OK: DISK OK [15:24:07] (03CR) 10jenkins-bot: [V: 04-1] Activate more logs on postgresql for maps. [puppet] - 10https://gerrit.wikimedia.org/r/292931 (owner: 10Gehel) [15:24:11] James_F: look right to you: https://gerrit.wikimedia.org/r/#/c/292944/ ? [15:25:36] (03PS1) 10BBlack: fix arrow alignment .... [puppet] - 10https://gerrit.wikimedia.org/r/292946 [15:25:56] (03CR) 10BBlack: [C: 032 V: 032] fix arrow alignment .... [puppet] - 10https://gerrit.wikimedia.org/r/292946 (owner: 10BBlack) [15:26:38] gehel: the jenkins failure is from arrow alignment issue I created in an earlier commit, where I gave up waiting on jenkins [15:26:41] fixed now [15:26:46] sorry! [15:26:57] (03PS3) 10Thcipriani: Use wfLoadExtension for LocalisationUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287316 (owner: 10Nikerabbit) [15:26:57] bblack: yep, I saw that. Thanks for the fix! [15:27:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287316 (owner: 10Nikerabbit) [15:28:04] (03Merged) 10jenkins-bot: Use wfLoadExtension for LocalisationUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287316 (owner: 10Nikerabbit) [15:28:42] (03PS6) 10BBlack: ssl_ciphersuite: standardize STS preload [puppet] - 10https://gerrit.wikimedia.org/r/292929 (https://phabricator.wikimedia.org/T132685) [15:28:51] (03PS6) 10BBlack: Set includeSub/preload for wikimedia.org in VCL [puppet] - 10https://gerrit.wikimedia.org/r/292930 (https://phabricator.wikimedia.org/T132685) [15:29:03] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 345, down: 1, shutdown: 0 [15:29:05] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for joewalsh - https://phabricator.wikimedia.org/T137110#2357563 (10Krenair) From the shell request instructions, I think this part is missing: > The project being worked on with a full and detailed reason for access and... [15:30:15] bblack, around? we need to create you a new password for the ip updater script [15:30:55] 06Operations, 10Traffic: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2357668 (10ema) [15:30:57] (03PS5) 10Gehel: Activate more logs on postgresql for maps. [puppet] - 10https://gerrit.wikimedia.org/r/292931 [15:32:34] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:287316|Use wfLoadExtension for LocalisationUpdate]] (duration: 00m 27s) [15:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:55] yurik: ok [15:33:11] ^ kart_ Nikerabbit wmfloadextension update sync'd [15:33:35] (03PS5) 10Thcipriani: ULS: Stop using /static/current [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [15:33:40] thcipriani: thanks. Special:Version is OK. [15:34:19] bblack, for the script that you have, can you use its credentials to login into zerowiki, go to Special:BotPasswords, give it some name, like varnishupdater or whatever, generate password (you can limit the IPs), and use them? [15:34:31] 06Operations, 10RESTBase, 06Services: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2357700 (10Danny_B) [15:34:42] 06Operations, 10Traffic: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2357701 (10ema) p:05Triage>03High [15:35:02] thcipriani: looks fine. [15:35:08] does the IP limiting thing take whole networks? [15:35:13] kart_: ack, thanks for checking [15:35:20] anyways, I'll look around... [15:35:44] I guess just set a flag and it will keep working with existing auth isn't really true :P [15:36:41] RECOVERY - puppet last run on mw2213 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:36:56] thcipriani: is it too late to put a patch up for SWAT? [15:37:07] (03CR) 10Gehel: [C: 032] Activate more logs on postgresql for maps. [puppet] - 10https://gerrit.wikimedia.org/r/292931 (owner: 10Gehel) [15:37:12] !log thcipriani@tin Synchronized php-1.28.0-wmf.4/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.MobileArticleTarget.js: SWAT: [[gerrit:292787|Fix config of mobile surfaces]] (duration: 00m 24s) [15:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:17] ^ James_F check please [15:37:54] thcipriani: Yup, fixed. Thanks! [15:38:01] James_F: thanks for checking [15:38:35] mobrovac: I'm pretty full today, but I can try to get to it at the end of SWAT. If it's a config change it should be possible. Core changes might be a little...tight for time. [15:38:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [15:39:40] (03Merged) 10jenkins-bot: ULS: Stop using /static/current [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [15:39:48] thcipriani: kk, it's a config change - https://gerrit.wikimedia.org/r/#/c/292937/, added you as a reviewer, lemme know if there's time [15:39:59] mobrovac: will do, thanks [15:40:03] thnx! [15:40:42] kart_: Krinkle https://gerrit.wikimedia.org/r/#/c/289652/ is on mw1017 [15:40:51] PROBLEM - NTP on mw1262 is CRITICAL: NTP CRITICAL: No response from NTP server [15:40:51] thcipriani: ok. Testing. [15:41:21] I see cache-control:public, s-maxage=31536000, max-age=31536000 [15:41:52] which looks correct, except I get that regardless whether the hash is correct or not [15:42:02] confirmed. font is loaded from https://www.mediawiki.org/w/extensions/UniversalLanguageSelector/ and with good cache headers [15:42:21] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 27.59% of data above the critical threshold [106250000.0] [15:42:42] okie doke, thank you. Will roll-out everywhere. [15:42:50] curl -I 'https://www.mediawiki.org/w/extensions/UniversalLanguageSelector/data/fontrepo/fonts/ComicNeue/ComicNeue-Regular.woff2?44c5e' has good headers [15:42:59] if you change one of the hash characters you get a short cache headers [15:43:05] Nikerabbit: Krinkle thanks! [15:43:27] meanwhile, Aw shoot, Xhgui hit an error [15:43:32] Krinkle: ^ [15:43:34] Nikerabbit: But if you use a string of different length, it'll consider it garbage and cache it defensively as long. [15:43:59] Krinkle: goood to know that gotcha [15:44:05] Krinkle: can confirm it works as described [15:44:24] kart_: just checked here before pulling the final trigger, should I hold? [15:44:33] thcipriani: go ahead. [15:44:36] kk [15:45:11] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:289652|ULS: Stop using /static/current]] (duration: 00m 24s) [15:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:18] ^ kart_ Krinkle Nikerabbit sync'd [15:45:38] thanks thcipriani, Krinkle & kart_ [15:45:47] yurik: you can make your change now [15:45:56] on it [15:45:59] yurik: Exception: Bad response code 401 from API request for zeroportal [15:46:01] kart_: what error? [15:46:05] (after updating to the new creds) [15:46:21] tgr, ^ [15:46:53] bblack, did you use the new user name as given by the bot page? (with the @ symbol)? [15:47:12] yurik: yes, and the login itself works, it's the fetch of JSON data afterwards that fails with 401 [15:47:21] lovely [15:47:36] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for joewalsh - https://phabricator.wikimedia.org/T137110#2357752 (10dr0ptp4kt) @Krenair, thanks for noting. This will be for analysis of prod web request logs (Hive/Hadoop) and event logging tables (MySQL/MariaDB). I thi... [15:47:57] yurik: does it require non-basic rights? [15:48:36] yurik: are you using the API to get the data, or just a plain pageview? [15:48:42] bblack, it requires 'zero-script' right [15:49:02] that's not in the checklist for the botpasswords special page... [15:49:10] Nikerabbit: anything need to happen pre https://gerrit.wikimedia.org/r/#/c/292898/2 deploy? [15:49:14] (03CR) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [15:49:17] tgr, api to get the data [15:49:46] thcipriani: PrivateSettings.php needs to be updated, but I don't know if I can edit that file myself [15:49:47] yurik: in that case we need a config patch to add zero-script to $wgGrantPermissions on zerowiki, then update the password with it [15:50:31] I can do the latter part, I still have the page open [15:50:52] the second one might not be needed if it's to an existing grant group, I don't remember if grants or individual rights are stored [15:50:58] Nikerabbit: you can if you've access to tin [15:51:18] kart_: okay [15:52:25] tgr, i am guessing that my change that thcipriani is waiting for should happen after your config change [15:52:46] (adding zerowiki password for the banner access) [15:53:05] yurik: what grant group would you suggest for zero-script? basic? editpage? [15:53:32] 06Operations, 10hardware-requests: Replace/refresh carbon - https://phabricator.wikimedia.org/T137117#2357766 (10faidon) [15:53:40] see DefaultSettings.php line 5630 for the existing ones [15:53:40] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.005 second response time [15:53:59] tgr, sysop? [15:54:14] hmm, I'm note sure that permissions for privatesettings.php are correct on tin [15:55:07] zero-script has to be added to an existing grant group? none of them really sound appropriate [15:55:09] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for joewalsh - https://phabricator.wikimedia.org/T137110#2357785 (10Krenair) >>! In T137110#2357752, @dr0ptp4kt wrote: > @Krenair, thanks for noting. This will be for analysis of prod web request logs (Hive/Hadoop) and ev... [15:55:18] it's a custom thing for a custom thing... [15:55:19] grant groups are a bit different from user groups, but I'll add it to editinterface then, that's the most sysop-ish [15:55:46] tgr, https://noc.wikimedia.org/conf/highlight.php?file=CommonSettings.php -- look for zero-script [15:56:27] tgr, also, this right only exists on zerowiki [15:56:33] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for joewalsh - https://phabricator.wikimedia.org/T137110#2357563 (10Ottomata) This request is for `analytics-privatedata-users` and `researchers`. [15:56:40] (03PS2) 10Thcipriani: Math: Set wgMathFullRestbaseURL to point to wikimedia.org in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292937 (https://phabricator.wikimedia.org/T136205) (owner: 10Mobrovac) [15:56:52] mobrovac: I'm going to get your change out the door real quick. [15:57:00] RECOVERY - NTP on mw1262 is OK: NTP OK: Offset 0.008191466331 secs [15:57:07] grazie thcipriani! [15:57:18] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292937 (https://phabricator.wikimedia.org/T136205) (owner: 10Mobrovac) [15:57:21] RECOVERY - nutcracker port on mw1262 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [15:57:40] RECOVERY - salt-minion processes on mw1262 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:57:41] RECOVERY - nutcracker process on mw1262 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [15:58:01] RECOVERY - Check size of conntrack table on mw1262 is OK: OK: nf_conntrack is 0 % full [15:58:11] thcipriani: yeah I can confirm I cannot edit the file myself on tin [15:58:11] RECOVERY - configured eth on mw1262 is OK: OK - interfaces up [15:58:41] RECOVERY - Disk space on mw1262 is OK: DISK OK [15:58:42] RECOVERY - dhclient process on mw1262 is OK: PROCS OK: 0 processes with command name dhclient [15:58:51] RECOVERY - MD RAID on mw1262 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [16:00:00] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:292937|Math: Set wgMathFullRestbaseURL to point to wikimedia.org in production]] (duration: 00m 24s) [16:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:05] ^ mobrovac check please [16:00:11] kk [16:00:30] RECOVERY - DPKG on mw1262 is OK: All packages OK [16:00:56] yurik: any other right that might be needed? [16:01:53] thcipriani: is my patch going to make it today or should I re-schedule? [16:01:56] Nikerabbit: yeah, I'm not sure what the permissions are supposed to be on that file. If anyone in wikidev should be able to write it, I guess it should be 664 rather than 644 [16:02:01] thcipriani: works like a charm! thnx! [16:02:08] mobrovac: thanks for checking. [16:02:33] tgr, not that i see in the code - seems like that's the only right that i check is zero-script [16:03:29] Nikerabbit: yurik I think at this point we should reschedule the changes need in privatesettings.php we're already over time I've got to run to a meeting. Sorry for the confusion here :( [16:03:48] (03PS1) 10Gergő Tisza: Create zeroscript grant group for zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292951 (https://phabricator.wikimedia.org/T135074) [16:03:52] thcipriani, agree [16:04:03] thcipriani: sure :( [16:04:24] thanks for understanding folks. [16:04:42] not your fault of course but I need to specifically work later to attend the swats... I'll try to get first spot some other day [16:05:20] yurik: https://gerrit.wikimedia.org/r/#/c/292951/ [16:06:11] (03CR) 10Hashar: [C: 031] "Thanks! Might want to add a second change to clear furud from sites.pp and have it back in the pool of available hardware." [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [16:07:03] thcipriani: I can finish the SWAT if that's okay, the next release window is empty and I'd really like to get the AuthManager changes out today [16:07:28] tgr: absolutely, thank you :) [16:08:00] oh? [16:08:01] yurik: Nikerabbit: is that okay with you? [16:08:11] tgr, sure, reviewing your patch [16:09:22] tgr: would be great if it can happen during next 20 or so minutes! [16:11:08] 06Operations, 10Math, 10RESTBase, 06Services, 15User-mobrovac: parameter mathpurge=true should purge cache in restbase - https://phabricator.wikimedia.org/T136205#2357851 (10mobrovac) 05Open>03Resolved The Math extension will start sending `no-cache` requests to RESTBase on `mathpurge=true` when `1.2... [16:13:50] tgr, i don't think we need any of the new $wgGrantPermissions lines, only the $wgGrantPermissionGroups [16:15:24] yurik: it works like umask, you need the user right both via a group and a grant to be able to use it [16:15:28] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [93750000.0] [16:15:46] also, the grant and the group are unrelated, I just chose the same name [16:15:50] tgr, yes, but the same line is being set a few lines above [16:16:06] $wgGroupPermissions['zeroscript']['zero-script'] = true; [16:16:12] those are GroupPermissions, this is GrantPermissions [16:16:17] unless I messed up [16:16:30] ah [16:16:39] this creates a 'zeroscript' checkbox in the bot password grant list [16:17:03] if it's checked, these rights are not deleted from the permission list when using a bot password [16:17:14] sorry, misread. Lets only grant the zero-script right though, none of the other ones should be accessible this way [16:17:32] the -ips is not needed, and should be deleted [16:17:41] in the whole file [16:18:13] (03PS2) 10Gergő Tisza: Create zeroscript grant group for zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292951 (https://phabricator.wikimedia.org/T135074) [16:18:25] does that look OK? [16:19:06] (03PS3) 10Gergő Tisza: Create zeroscript grant group for zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292951 (https://phabricator.wikimedia.org/T135074) [16:19:10] (03CR) 10Yurik: [C: 031] Create zeroscript grant group for zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292951 (https://phabricator.wikimedia.org/T135074) (owner: 10Gergő Tisza) [16:19:38] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Puppet has 3 failures [16:19:46] yep [16:19:54] (03CR) 10Gergő Tisza: [C: 032] Create zeroscript grant group for zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292951 (https://phabricator.wikimedia.org/T135074) (owner: 10Gergő Tisza) [16:20:38] (03Merged) 10jenkins-bot: Create zeroscript grant group for zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292951 (https://phabricator.wikimedia.org/T135074) (owner: 10Gergő Tisza) [16:21:38] PROBLEM - Apache HTTP on mw1262 is CRITICAL: Connection refused [16:22:31] !log tgr@tin Synchronized wmf-config/CommonSettings.php: creating zeroscript grant group on zerowiki, gerrit: 292951 (duration: 00m 28s) [16:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:36] yurik: please check [16:23:02] tgr, actually bblack should add that right to his bot acct [16:23:39] bblack: should be something like in the grant list [16:23:42] <_joe_> !log rebooting mw1262 [16:24:06] (you can create a zerowiki MW: page for that message if you want a nicer name) [16:24:12] yurik: tgr: works [16:24:28] great, thanks! anything left to do about this? [16:24:33] don't think so [16:24:55] tgr, yes, now we need to change private settings for the banner account from other wikis [16:25:43] yurik: are you doing that or should I do it? [16:25:48] doing it [16:25:51] thx [16:26:14] bblack, what ips did you set for it? [16:26:25] because in theory i should do the same [16:28:26] 06Operations, 06Labs, 10Tool-Labs, 10netops: 'German Wikipedia Broken Weblinks Bot' is ill-behaved and in danger of getting all of Labs blacklisted - https://phabricator.wikimedia.org/T136829#2357927 (10Andrew) 05Open>03Resolved Update: the bitninja people seem to be wrong about everything. I'm closi... [16:29:48] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.095 second response time [16:29:57] 06Operations, 06Labs, 10Tool-Labs, 10netops: 'German Wikipedia Broken Weblinks Bot' is ill-behaved and in danger of getting all of Labs blacklisted - https://phabricator.wikimedia.org/T136829#2357942 (10MoritzMuehlenhoff) FTR, I received the same vague reply as Andrew, seems mostly auto-generated... [16:30:33] tgr, should i be changing a readonly file at /srv/mediawiki/private? [16:31:31] yurik: /srv/mediawiki/private, and sync it [16:31:36] no [16:31:51] sorry, mistyped that, meant /srv/mediawiki-staging/private [16:34:03] tgr, i modified the file, want to sync it? [16:34:14] i should probably add it to git [16:34:19] yes [16:34:26] btw, why is it readonly? [16:34:35] it isn't [16:34:40] which file is that? [16:34:43] 06Operations, 10ops-eqiad, 06Analytics-Kanban: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2357954 (10Milimetric) [16:35:04] privatesettings.php [16:35:42] ok, checked it in. tgr go ahead and sync [16:35:45] hope it works ;) [16:35:46] yurik: ugh, apparently I took ownership of that file somehow [16:35:53] will fix later [16:37:08] !log tgr@tin Synchronized private/PrivateSettings.php: (no message) (duration: 00m 26s) [16:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:14] !log tgr@tin Synchronized wmf-config/PrivateSettings.php: (no message) (duration: 00m 27s) [16:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:20] yurik: please check [16:39:37] tgr, heh, checking is always fun with zero :) [16:39:43] looking [16:39:44] (03PS1) 10Andrew Bogott: Add 'libertine' fonts to tools exec nodes. [puppet] - 10https://gerrit.wikimedia.org/r/292954 (https://phabricator.wikimedia.org/T137121) [16:39:49] !log PrivateSettings changes were for T135074 [16:39:50] T135074: Update JsonConfig for AuthManager - https://phabricator.wikimedia.org/T135074 [16:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:56] tgr, something went boom in fatal mon [16:43:05] yurik: looks like no revisions are returned by the API [16:43:11] yep [16:43:16] not sure why [16:43:22] possibly another grants issue? [16:43:26] possible [16:43:28] checking... [16:43:29] does the password itself work? [16:43:37] sec [16:43:43] I can try touching the private redirect, that's needed sometimes [16:43:55] sure [16:44:08] tgr, do you have an easy way to query api? [16:44:20] i don't have any scripts handy [16:44:27] !log tgr@tin Synchronized wmf-config/PrivateSettings.php: (no message) (duration: 00m 23s) [16:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:34] yurik: ApiSandbox on zerowiki? [16:44:37] (03CR) 10Andrew Bogott: [C: 032] Add 'libertine' fonts to tools exec nodes. [puppet] - 10https://gerrit.wikimedia.org/r/292954 (https://phabricator.wikimedia.org/T137121) (owner: 10Andrew Bogott) [16:44:56] tgr, can i use bot password to login into wiki? [16:45:13] yurik, no, but you can use it to log in into the API [16:45:31] ie. log in normally, go to the special page, use login API with bot password [16:45:43] if you have both sets of cookies, bot gets preference [16:47:32] Nikerabbit: I don't think there will be time for TranslateNotifications, sorry :( [16:48:10] tgr: k [16:48:46] tgr, did you roll back the change? [16:48:56] i'm still testing [16:48:58] yurik: no [16:49:01] ok [16:49:14] just tried to mess with the access dates, that helps sometimes [16:51:33] (03PS1) 10Andrew Bogott: Revert "Add 'libertine' fonts to tools exec nodes." [puppet] - 10https://gerrit.wikimedia.org/r/292957 [16:52:46] (03CR) 10Paladox: [C: 031] Gerrit: kill replication to furud [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [16:54:01] (03CR) 10Andrew Bogott: [C: 032] Revert "Add 'libertine' fonts to tools exec nodes." [puppet] - 10https://gerrit.wikimedia.org/r/292957 (owner: 10Andrew Bogott) [16:55:09] tgr, actually i am not sure this is production - i just remembered that we have another script on the server that's using creds to get this data [16:55:27] actually two - one in analytics, and one ours [16:55:37] but it shouldn't generate this many hits [16:56:02] tgr, just to be more consistent, lets roll back the private settings change [16:56:11] and i will work on figuring it out [16:56:14] yurik: ack, will do [16:57:37] (03PS3) 10Dzahn: Gerrit: kill replication to furud [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [16:57:51] !log tgr@tin Synchronized private/PrivateSettings.php: (no message) (duration: 00m 23s) [16:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:21] !log tgr@tin Synchronized wmf-config/PrivateSettings.php: (no message) (duration: 00m 23s) [17:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160606T1700). [17:00:40] yurik: fatalmonitor is still flooded [17:01:37] tgr, could it be bblack's script? [17:02:03] bblack: any thoughts? seems to run every minute exactly [17:02:12] ^ SMalyshev: I'm starting the deployment on beta first... [17:03:28] (03CR) 10Dzahn: [C: 032] "saw that -2 vote from jenkins-bot? went away with just rebase . a bit random" [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [17:04:14] mutante: Thx for the merge, doesn't need a gerrit restart, I'll reload the replication plugin [17:04:16] (03PS1) 10Andrew Bogott: Add tex/latex fonts to tools exec nodes. [puppet] - 10https://gerrit.wikimedia.org/r/292958 (https://phabricator.wikimedia.org/T137121) [17:04:33] (03PS2) 10Andrew Bogott: Add tex/latex fonts to tools exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/292958 (https://phabricator.wikimedia.org/T137121) [17:04:40] gehel: deploying any config changes? we are still trying to clean up some logspam that resulted from the SWAT [17:04:46] ostriches: cool! are we shutting furud down then? [17:04:54] Yeah that's the plan [17:05:02] :) ok [17:05:08] tgr: only wikidata query service, there should be no interactions with anything else [17:05:18] cool, thanks [17:05:23] (03PS3) 10Andrew Bogott: Add tex/latex fonts to tools exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/292958 (https://phabricator.wikimedia.org/T137121) [17:06:49] 06Operations, 10Gerrit: Gerrit replication to furud.codfw.wmnet fails with: reject HostKey: furud.codfw.wmnet - https://phabricator.wikimedia.org/T136822#2358092 (10demon) 05Open>03Resolved a:03demon Replication to furud killed. [17:07:08] (03CR) 10Jcrespo: "Also seeking +1 from ori as I think he was involved on closing its access in the first place." [puppet] - 10https://gerrit.wikimedia.org/r/292405 (owner: 10Jcrespo) [17:07:48] yurik: should I just write a fix for the undefined index thing or is having no revisions a bad thing in itself? [17:08:16] tgr, it is - it means there are no zero configs, which should never be the case [17:08:27] unless you just created the wiki [17:08:49] and since this is a custom one of a kind wiki, it makes no sense, so its good that it warned us [17:09:15] i am surprised that "error" was not triggered in code right above that [17:09:28] i wonder what result it receives [17:09:56] but i'm pretty sure its bblack's script, so must be some other weird dependency in there [17:09:57] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [17:10:05] (03PS12) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [17:11:15] yurik: if it's that fast, whatever it is is almost certainly from zerofetcher, since it runs on ~100 hosts 4 times an hour. they'd all be on the minute-boundary, but usually more than one hit per minute. [17:11:46] SMalyshev: GUI updated on https://wdqs-test.wmflabs.org/, looks good to me, let me know if you want to test anything [17:12:18] (03CR) 10Andrew Bogott: [C: 032] Add tex/latex fonts to tools exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/292958 (https://phabricator.wikimedia.org/T137121) (owner: 10Andrew Bogott) [17:12:29] (03PS1) 10BBlack: fix piwiki and grafana-admin HTTP checks for 401 [puppet] - 10https://gerrit.wikimedia.org/r/292959 [17:12:38] (03CR) 10Elukey: "I implemented all Brandon's suggestion plus another one that seemed related, namely the removal of the 'begin:' prefix since it was not us" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [17:12:43] (03PS2) 10BBlack: fix piwiki and grafana-admin HTTP checks for 401 [puppet] - 10https://gerrit.wikimedia.org/r/292959 [17:12:47] (03CR) 10Jcrespo: [C: 031] Add ores-admins group and provide permissions for scb [puppet] - 10https://gerrit.wikimedia.org/r/291716 (https://phabricator.wikimedia.org/T136406) (owner: 10Ladsgroup) [17:12:59] gehel: looks ok to me [17:14:29] !log deploying latest GUI for wikidata query service [17:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:00] yurik: definitely bblack's script, the timing matches [17:15:11] started at 16:23 [17:16:11] logstash claims 600 warnings every minute, not sure if that can be trusted [17:16:23] bblack: i think with check_http the -e is just a number [17:16:31] SMalyshev: deployment completed, test queries passing, https://query.wikidata.org looks good to me [17:16:41] gehel: excellent, thanks! [17:17:11] SMalyshev: Number of WDQS deployment without complete failure in a row: 1... [17:17:21] now let's try to keep that number going up! [17:17:26] :P [17:17:32] we'll see ;) [17:17:56] mutante: [17:17:57] -e, --expect=STRING [17:17:57] Comma-delimited list of strings, at least one of them is expected in [17:18:00] the first (status) line of the server response (default: HTTP/1.) [17:18:03] If specified skips all other status line logic (ex: 3xx, 4xx, 5xx processing) [17:18:29] anomie: do you need any extra permissions to fetch content from a private wiki via MWGrants? [17:18:45] it looks like the requests are not rejected but there are no revisions [17:19:20] SMalyshev: the "Biological databases listed in Wikidata and if available applicable licenses" example query seems to have a typo. My SPARQL is bad enough that I'm not even sure I can spot it... [17:19:26] tgr: You shouldn't need any extra permissions, and if you're lacking 'read' it should error out. [17:19:32] I wonder how many tools are going to break as HTTP/2 goes live on various internal services over time heh [17:19:43] bblack: oh! well, nevermind then :) [17:19:44] the status lines are e.g. "HTTP/2.0 200" [17:20:13] gehel: why you think it has a typo? [17:20:15] SMalyshev: Oh, those example queries are stored on wiki? Kool! [17:20:21] tgr: Unless you're using an extension's content-fetching mechanism that decides to default to no-results instead of erroring on lack of permission. [17:20:21] probably /^HTTP\/1\.[01]/ is hardcoded in various places like check_http (but on the upside, check_http itself would need http/2 upgrades first before being affected) [17:20:25] gehel: yes they are [17:20:29] piwik/piwiki it's almost impossible to type a word that ends in wik and not add the i [17:20:39] 06Operations, 10hardware-requests: eqiad: spare allocation to replace labmon1001 - https://phabricator.wikimedia.org/T136970#2358148 (10mark) Approved. [17:21:08] SMalyshev: because when I try to run it, it gives me "Encountered "" at line 10, column 31." [17:21:32] gehel: ah, I see it. Missing } [17:21:33] SMalyshev: s/typo/syntax error/ [17:21:33] anomie: it's using zeroportal API https://github.com/wikimedia/mediawiki-extensions-ZeroPortal/blob/master/includes/ApiZeroPortal.php which doesn't seem to do anything special [17:21:36] tgr: if this is still about the zero fetcher, it did actually get data on my test fetch earlier [17:21:40] (03CR) 10Dzahn: [C: 031] fix piwiki and grafana-admin HTTP checks for 401 [puppet] - 10https://gerrit.wikimedia.org/r/292959 (owner: 10BBlack) [17:21:55] it seemed to do so rather slowly, though, maybe the fatal along the way slows it down [17:22:10] bblack: the warnings started the exact minute you tested it, so they are definitely triggered by the script [17:22:19] SMalyshev: right! I need to dig a bit into SPARQL... [17:22:34] gehel: I have a book but it's in the office :) [17:22:39] tgr: yeah I'm just saying, the perms "work", it does actually get the JSON content it asked for [17:23:51] SMalyshev: books are soo XXth century... [17:24:36] gehel: I know, I'm old-fashioned :) [17:24:52] bblack: can you revert the password for now? even if it works, it spams the logs with warnings [17:24:59] I'll do some manual testing later [17:25:17] (03CR) 1020after4: [C: 031] "failing because of lines longer than 80 characters." [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [17:25:21] tgr: ok [17:30:40] (03PS1) 10Yuvipanda: tools: Install jdk8 in trusty nodes [puppet] - 10https://gerrit.wikimedia.org/r/292960 (https://phabricator.wikimedia.org/T121279) [17:31:48] bd808: ^ wanna +1? [17:32:43] (03CR) 10BryanDavis: [C: 031] tools: Install jdk8 in trusty nodes [puppet] - 10https://gerrit.wikimedia.org/r/292960 (https://phabricator.wikimedia.org/T121279) (owner: 10Yuvipanda) [17:34:23] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:40:37] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2358193 (10elukey) @Jalexander, really sorry for the delay in the answer, I completely missed the phab assigned to me... [17:44:49] (03CR) 10BBlack: Extend the %{format}t timestamp formatter with (begin|end): prefixes (032 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [17:47:31] (03PS1) 10Bmansurov: huwiki: Enable A/B test for 50% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292964 (https://phabricator.wikimedia.org/T136713) [17:47:53] yurik: what is wgZeroPortalImpersonateUser for? [17:48:11] tgr, sorry, meeting [17:48:20] no rush [17:54:59] bblack: is the source code of your script in gerrit? [17:57:02] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for joewalsh - https://phabricator.wikimedia.org/T137110#2358241 (10dr0ptp4kt) Thanks, @Krenair and @Ottomata. @Krenair, the logs and tables do contain data specifically about the apps. [18:01:23] tgr: https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/files/zerofetch.py [18:06:28] (03CR) 10BBlack: [C: 032] fix piwiki and grafana-admin HTTP checks for 401 [puppet] - 10https://gerrit.wikimedia.org/r/292959 (owner: 10BBlack) [18:08:03] !log Running rebuildrecentchanges.php for test2wiki for T133225 [18:08:04] T133225: test2wiki has no recent changes before the 20 april - https://phabricator.wikimedia.org/T133225 [18:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:12:05] (03PS1) 10BBlack: zerofetcher: use secret() directly [puppet] - 10https://gerrit.wikimedia.org/r/292969 [18:13:10] (03CR) 10Dzahn: "furud isnt hardware, was VM, but decom yea. so.. but what about antimony then?" [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [18:13:50] 06Operations, 06Performance-Team, 05codfw-rollout: test2wiki has no recent changes before the 20 april - https://phabricator.wikimedia.org/T133225#2358322 (10ori) 05Open>03Resolved a:03ori I repopulated recent changes on test2wiki with all changes made on or after 2015-01-01. [18:16:37] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.853 second response time [18:16:58] (03CR) 10Alexandros Kosiaris: [C: 032] ores: deprecate flower [puppet] - 10https://gerrit.wikimedia.org/r/292769 (https://phabricator.wikimedia.org/T137003) (owner: 10Ladsgroup) [18:17:00] (03PS3) 10Alexandros Kosiaris: ores: deprecate flower [puppet] - 10https://gerrit.wikimedia.org/r/292769 (https://phabricator.wikimedia.org/T137003) (owner: 10Ladsgroup) [18:18:09] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 15.049 second response time [18:18:54] (03PS1) 10Dzahn: remove furud from site.pp,dhcp,installserver [puppet] - 10https://gerrit.wikimedia.org/r/292971 [18:19:09] (03CR) 10BBlack: [C: 032] zerofetcher: use secret() directly [puppet] - 10https://gerrit.wikimedia.org/r/292969 (owner: 10BBlack) [18:19:51] (03PS2) 10Dzahn: remove furud from site.pp,dhcp,installserver [puppet] - 10https://gerrit.wikimedia.org/r/292971 [18:20:15] (03PS4) 10Alexandros Kosiaris: ores: deprecate flower [puppet] - 10https://gerrit.wikimedia.org/r/292769 (https://phabricator.wikimedia.org/T137003) (owner: 10Ladsgroup) [18:20:21] (03CR) 10Alexandros Kosiaris: [V: 032] ores: deprecate flower [puppet] - 10https://gerrit.wikimedia.org/r/292769 (https://phabricator.wikimedia.org/T137003) (owner: 10Ladsgroup) [18:20:46] (03PS3) 10Dzahn: remove furud from site.pp,dhcp,installserver [puppet] - 10https://gerrit.wikimedia.org/r/292971 [18:21:11] (03PS7) 10Alexandros Kosiaris: Add ores-admins group and provide permissions for scb [puppet] - 10https://gerrit.wikimedia.org/r/291716 (https://phabricator.wikimedia.org/T136406) (owner: 10Ladsgroup) [18:21:19] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add ores-admins group and provide permissions for scb [puppet] - 10https://gerrit.wikimedia.org/r/291716 (https://phabricator.wikimedia.org/T136406) (owner: 10Ladsgroup) [18:21:21] (03PS4) 10Dzahn: remove furud from site.pp,dhcp,installserver [puppet] - 10https://gerrit.wikimedia.org/r/292971 (https://phabricator.wikimedia.org/T123718) [18:21:53] (03PS3) 10Alexandros Kosiaris: Introduce ores.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/292542 (https://phabricator.wikimedia.org/T124203) [18:21:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce ores.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/292542 (https://phabricator.wikimedia.org/T124203) (owner: 10Alexandros Kosiaris) [18:22:15] (03PS13) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [18:22:57] (03PS14) 10Elukey: Extend the %{format}t timestamp formatter with the 'end:' prefix [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [18:23:55] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add varnish backend in the misc cluster [puppet] - 10https://gerrit.wikimedia.org/r/292543 (https://phabricator.wikimedia.org/T124203) (owner: 10Alexandros Kosiaris) [18:24:00] (03PS2) 10Alexandros Kosiaris: ores: Add varnish backend in the misc cluster [puppet] - 10https://gerrit.wikimedia.org/r/292543 (https://phabricator.wikimedia.org/T124203) [18:24:04] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Add varnish backend in the misc cluster [puppet] - 10https://gerrit.wikimedia.org/r/292543 (https://phabricator.wikimedia.org/T124203) (owner: 10Alexandros Kosiaris) [18:24:12] bblack: thanks for the +1 on ^ [18:24:50] (03PS3) 10Dzahn: add moon.wikimedia.org, point to cluster [dns] - 10https://gerrit.wikimedia.org/r/292771 (https://phabricator.wikimedia.org/T136557) [18:25:02] (03PS4) 10Dzahn: add moon.wikimedia.org, point to cluster [dns] - 10https://gerrit.wikimedia.org/r/292771 (https://phabricator.wikimedia.org/T136557) [18:25:13] 06Operations, 10hardware-requests: Replace/refresh carbon - https://phabricator.wikimedia.org/T137117#2357766 (10RobH) We do not have any spare systems with 3TB * 6, only 4TB * 4 and it would max out the disk bays of those particular systems: Example system WMF4723. Dual Intel® Xeon® Processor E5- 2623 V3 w... [18:29:37] (03CR) 10BBlack: [C: 031] "Looks good on human review to me (conceptual and basic C-code structural stuff). Note my +1 doesn't include actual compilation or testing" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [18:30:26] (03CR) 10Dzahn: [C: 032] "https://meta.wikimedia.org/wiki/Wikipedia_to_the_Moon" [dns] - 10https://gerrit.wikimedia.org/r/292771 (https://phabricator.wikimedia.org/T136557) (owner: 10Dzahn) [18:31:53] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Create moon.wikimedia.org and redirect it to https://meta.wikimedia.org/wiki/Wikipedia_to_the_Moon - https://phabricator.wikimedia.org/T136557#2338844 (10Dzahn) added to DNS: moon.wikimedia.org has address 198.35.26.96... [18:33:28] (03PS2) 10Dzahn: redirect moon.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/292772 (https://phabricator.wikimedia.org/T136557) [18:33:34] (03PS1) 10BBlack: fix grafana(-admin) icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/292975 [18:33:48] (03PS3) 10Dzahn: redirect moon.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/292772 (https://phabricator.wikimedia.org/T136557) [18:33:56] (03CR) 10BBlack: [C: 032 V: 032] fix grafana(-admin) icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/292975 (owner: 10BBlack) [18:37:32] 06Operations, 10hardware-requests: Replace/refresh carbon - https://phabricator.wikimedia.org/T137117#2358514 (10faidon) 4*4TB makes for 8TB usable, so it could work, although it's kind of suboptimal (just 4 spindles); 32GB would probably work fine for now (carbon has just 8GB!), and 10GbE… can probably wait u... [18:40:16] 06Operations, 06Performance-Team, 07Availability: Audit mysql database class and hhvm binding support of SSL - https://phabricator.wikimedia.org/T136218#2358552 (10aaron) Add any ssl flags needed to DB classes and test with mysql/mariadb in vagrant. [18:40:32] 06Operations, 06Performance-Team, 07Availability: Audit mysql database class and hhvm binding support of SSL - https://phabricator.wikimedia.org/T136218#2358554 (10aaron) [18:44:06] (03PS7) 10BBlack: ssl_ciphersuite: standardize STS preload [puppet] - 10https://gerrit.wikimedia.org/r/292929 (https://phabricator.wikimedia.org/T132685) [18:44:31] (03CR) 10BBlack: [C: 032 V: 032] "puppet-compiler checked out ok on a number of the affected hosts" [puppet] - 10https://gerrit.wikimedia.org/r/292929 (https://phabricator.wikimedia.org/T132685) (owner: 10BBlack) [18:44:41] (03PS7) 10BBlack: Set includeSub/preload for wikimedia.org in VCL [puppet] - 10https://gerrit.wikimedia.org/r/292930 (https://phabricator.wikimedia.org/T132685) [18:45:24] (03CR) 10BBlack: [C: 032 V: 032] Set includeSub/preload for wikimedia.org in VCL [puppet] - 10https://gerrit.wikimedia.org/r/292930 (https://phabricator.wikimedia.org/T132685) (owner: 10BBlack) [18:49:05] 06Operations, 10MediaWiki-ResourceLoader, 06Performance-Team, 10Traffic: Image urls in CSS remain cached with old $wgResourceBasePath - https://phabricator.wikimedia.org/T134368#2358567 (10Krinkle) a:03Krinkle [18:51:18] jouncebot: next [18:51:18] In 1 hour(s) and 8 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160606T2000) [18:53:24] (03PS4) 10Dzahn: redirect moon.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/292772 (https://phabricator.wikimedia.org/T136557) [18:54:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:55:47] seems to be a very bref/small spike of 500 [18:55:52] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:55:54] *brief [18:56:12] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: puppet fail [18:57:59] (03CR) 10Hashar: "I guess the idea was to migrate gitblit from antinomy (Ubuntu) to furud (Jessie). We are keeping gitblit around until Differential is 100" [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [18:58:07] bblack: could you maybe make a copy of that zerowiki bot password for testing? or just share the password and change it afterwards? [18:58:28] tgr: yes [18:59:59] (03CR) 10Paladox: "I thought we can now redirect git.wikimedia.org to diffusion since we now import all the refs in diffusion." [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [19:00:23] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:00:23] (03CR) 10Dzahn: "furud was already created to be a stop gap so that antimony can be shutdown which is our actual priority here because that is precise" [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [19:00:53] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:01:45] (03CR) 10Dzahn: "so while "We are keeping gitblit around" is an option it should not be "we are keeping antimony around" please" [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [19:02:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:03:54] trying to log in to mediawiki.org: Exception encountered, of type "Exception" [19:04:02] very helpful error message [19:04:57] hmm, just logged out and back in, worked normal [19:05:10] twentyafterfour: https://en.wikipedia.org/wiki/Help:Logging_in#Login_issues_and_problems [19:05:20] ran into that the other day, evidently a known thing [19:06:08] (03CR) 10Chad: "Keeping it until we can write up some apache redirects, not until we're set with Differential. We only need Diffusion, which we have." [puppet] - 10https://gerrit.wikimedia.org/r/292940 (owner: 10Chad) [19:06:23] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: puppet fail [19:06:23] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [19:06:33] PROBLEM - PyBal backends health check on lvs1011 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [19:06:33] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [19:06:39] !log aaron@tin Synchronized php-1.28.0-wmf.4/includes/page/WikiPage.php: 661c22db3a352 (duration: 00m 30s) [19:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:08] is the stream thing from some known cause? [19:08:23] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:08:35] !log ran puppet on carbon because icinga said fail, saw it change STS headers, but no fail [19:08:36] some of the puppetfails are temporary race-conditions on the HSTS fix [19:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:44] they fix themselves on next run, yeah [19:08:51] *nod* alright [19:09:01] i dunno about rcstrea, i think no [19:09:34] oh, here [19:09:38] SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [19:09:41] on both nodes [19:09:54] yeah I'm looking at it [19:10:02] but that says 9 days ago.. what .reallly [19:10:10] it's related to the STS rollout, apparently the config change triggers an nginx reload [19:10:21] there is a ticket link there too, not new [19:10:32] but nginx gets stuck reloading, probably because it waits for persistent connections to drain and the websocket ones never do? [19:10:40] https://phabricator.wikimedia.org/T134361 [19:11:11] eh, that is linked in icinga, but different error type for expiration [19:12:04] !log restarted nginx on rcs1002 (was stuck half-shut-down for reload?), started nginx on rcs1001 (wasn't running at all) [19:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:13:09] 06Operations, 06Discovery, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2358627 (10RobH) a:05RobH>03Cmjohnson assigned to @Cmjohnson for the disk installations/onsite part. Once done feel free to assign back to m... [19:14:32] (03CR) 10Dzahn: [C: 032] "tested on mw1017 / apache-fast-test" [puppet] - 10https://gerrit.wikimedia.org/r/292772 (https://phabricator.wikimedia.org/T136557) (owner: 10Dzahn) [19:14:40] (03PS1) 10Andrew Bogott: Ensure mariadb service is running on wikitech host. [puppet] - 10https://gerrit.wikimedia.org/r/292980 (https://phabricator.wikimedia.org/T125987) [19:14:42] (03PS1) 10Andrew Bogott: Ensure mariadb running on labservices hosts. [puppet] - 10https://gerrit.wikimedia.org/r/292981 [19:15:52] !log restarting kafka broker on kafka1020 to test python consumption client [19:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:34] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 1 failures [19:16:42] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [19:16:44] RECOVERY - PyBal backends health check on lvs1011 is OK: PYBAL OK - All pools are healthy [19:16:52] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [19:18:52] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 11.198 second response time [19:20:52] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 9.072 second response time [19:24:23] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 13.79% of data above the critical threshold [106250000.0] [19:28:09] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2358666 (10Papaul) I chat with Joe on IRC he said to install Jessie on all the new mw app servers [19:32:51] (03CR) 10Ottomata: [C: 031] ":D" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [19:35:16] 06Operations, 10Fundraising-Backlog: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2358701 (10CCogdill_WMF) p:05Normal>03High Changing priority so can roadmap this. Please let me know your thoughts, thanks! [19:41:23] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [19:43:43] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:44:14] bblack: I think the warning issue tgr has been working on can be fixed (for now) by including the "High-volume editing" grant on the bot password. It looks like a bug that only occurs when there are too many Zero configuration pages on the wiki, and the apihighlimits right increases the critical number enough to not trigger it currently. [19:45:07] (03PS1) 10BBlack: clamav package update broken config workaround [puppet] - 10https://gerrit.wikimedia.org/r/292986 [19:45:22] (03CR) 10BBlack: [C: 032 V: 032] clamav package update broken config workaround [puppet] - 10https://gerrit.wikimedia.org/r/292986 (owner: 10BBlack) [19:46:27] anomie: do you mean too many past revs of the JSON file? note the bot account doesn't do any editing, only readonly fetching [19:46:57] bblack: No, too many pages in the "Zero" namespace. [19:47:27] oh ok [19:48:03] bblack: We haven't seen the bug until now because there are 5 proxy pages and 109 carrier pages. The critical number is 500 with apihighlimits, but only 50 without it. [19:49:14] anomie: ok I have the bot password high-volume editing [19:49:17] tgr: ^ [19:50:05] thanks! bblack, do you want to try to re-enable the password? [19:50:22] yeah I can try. it takes a little while to puppetize out to the all nodes [19:50:33] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [19:50:53] rolling out now... [19:51:19] bblack, tgr: I confirmed that manually running the query that was triggering the warnings before no longer triggers it. [19:52:55] (03CR) 10Thcipriani: "This is really cool, very useful for deployment." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [19:53:43] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [93750000.0] [19:53:45] bblack: is it a bad time to run "varnishadm ban" commands via salt [19:54:12] depends on the command [19:54:13] :) [19:54:29] I mean it's not a particularly bad time, but some ban commands are bad at all times :) [19:55:05] eh, yea , i'm following the docs for one-off purges [19:55:29] and did the same a little while ago [19:55:30] tgr, bblack: FYI, T137144 is an explanation of the bug here. [19:55:31] T137144: ApiZeroPortal spams logs with warnings if there are sufficient Zero configurations - https://phabricator.wikimedia.org/T137144 [19:56:04] and in 3 steps, first eqiad, then not eqiad then backends .. salt -v -t 30 -C 'G@cluster:cache_text and G@site:eqiad' cmd.run 'varnishadm ban req.http.host == "moon.wikimedia.org" [19:56:26] mutante: why? [19:56:33] i just wanted to make sure "rolling out now" wasnt conflicting [19:56:52] bblack: because it cached the default server page and i want the redirect to work now [19:57:06] shouldn't the default have been some kind of 404? [19:57:27] bblack: it was in DNS before it was in apache and i opened the URL and there is a default vhost [19:57:31] guess not [19:57:56] IMHO, that's a whole separate ticket. random unknown hostnames should 404, not show the wikimedia default page thingyt [19:58:15] the Apache config has a catch-all at the end [19:58:18] (if it had been a 404, it would've self-fixed in 5 mins or less) [19:58:19] yea [19:58:31] the last time i purged something it was for the same reason [19:58:58] anyways, it's not a bad time to ban [19:59:03] alright [19:59:10] but also note the ban docs are out of date, things are somewhat more-dynamic now [19:59:26] but the easy thing to do is eqiad, then codfw, then ulsfo+esams [19:59:30] for the backend purge ordering [19:59:32] when i ran them not that long ago they did work.. but salt was weird [19:59:43] and talked about hosts that are not in cluster:cache_text [19:59:54] like if it ran on * [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160606T2000). [20:00:05] salt always does that if it has no -b argument for batching [20:00:08] try " [20:00:15] try adding "-b 1000" to shut it up [20:00:19] nothing to deploy for mobileapps today [20:01:08] bblack: ok!, doing eqiad, codfw, then the rest [20:02:23] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:02:43] tgr: they're all using the new creds now, no log spam? [20:02:43] with -b 1000 , it took a while but none of all the unrelated hosts. ack. thanks [20:03:14] bblack, anomie just figured out what's causing it - could you login and add bot account rights? [20:03:18] bblack: log is clear [20:03:43] yurik: Already done. [20:03:45] tgr: ok, regenerating a new password and pushing that around too, could cause a small storm of login failures but will clear itself shortly [20:03:52] anomie, awesome, which account? [20:04:01] just the one that bblack is using? [20:04:07] Presumably. [20:04:24] i suspect that I will need to do it for a few more accounts, as well as fix it in the code [20:04:26] yurik: Netmapper [20:04:31] thx! [20:09:58] !log starting Parsoid deploy [20:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:10:45] bblack: done and https://wikitech.wikimedia.org/w/index.php?title=Varnish&type=revision&diff=614734&oldid=604211 [20:11:51] mutante: thanks, definite improvement for now :) [20:12:04] mutante: in the long run, there won't really be any consistent directions about doing that, though :/ [20:12:24] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Create moon.wikimedia.org and redirect it to https://meta.wikimedia.org/wiki/Wikipedia_to_the_Moon - https://phabricator.wikimedia.org/T136557#2358803 (10Dzahn) after having to purge the cached default page from varnish... [20:12:30] mutante: (because eventually the routing between remote DCs + codfw + eqiad will vary both dynamically and by-appserver (e.g. MW vs RB)) [20:14:06] bblack: ooh! ok. well, i guess i will make a ticket for the Apache default page behaviour, so that we get 404s in the future [20:14:35] thanks! [20:14:58] although I wonder if there's existing hostnames that get caught in the crossfire [20:15:21] (that are set up in DNS from long ago, expected to hit that landing page by-default because they're not configured in apache/mw) [20:15:52] yea, we need to check, i expect we will find a few [20:16:03] we could do some kind of hybrid solution too I guess: still show that page, but as the document text of a status==404 response :) [20:16:04] but i also dont think people will care about them if they just showed default all this time [20:16:18] oh, yea, i thought that actually [20:16:38] still show that page but send 404 [20:16:40] (03PS3) 10Gergő Tisza: Apply AbuseFilter configuration syntax change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 [20:16:49] but that brings us back to that epic ticket [20:17:01] about all the error pages [20:17:04] (03CR) 10Gergő Tisza: "Good point, thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [20:17:22] (03CR) 10jenkins-bot: [V: 04-1] Apply AbuseFilter configuration syntax change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [20:17:33] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:27] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup payments1005-8 - https://phabricator.wikimedia.org/T136881#2358810 (10Cmjohnson) [20:20:45] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup payments1005-8 - https://phabricator.wikimedia.org/T136881#2358818 (10Jgreen) [20:21:11] (03CR) 10Anomie: Apply AbuseFilter configuration syntax change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [20:21:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:21:57] (03PS4) 10Gergő Tisza: Apply AbuseFilter configuration syntax change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 [20:23:13] (03PS5) 10Gergő Tisza: Apply AbuseFilter configuration syntax change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 [20:25:10] !log updated Parsoid to version e8d6092e [20:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:25:33] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:25:43] (03PS1) 10Jdrewniak: T131526 A/B test on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292992 (https://phabricator.wikimedia.org/T131526) [20:26:03] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:27:03] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 200, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [20:28:17] (03CR) 10Anomie: [C: 031] "Looks good to go. My suggestion would be to merge this first, see that Beta Labs still shows the right configurations, deploy it to prod b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [20:29:40] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup new fundraising queue servers - https://phabricator.wikimedia.org/T136882#2358885 (10Jgreen) [20:32:02] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:32:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:32:33] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:43:09] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Create moon.wikimedia.org and redirect it to https://meta.wikimedia.org/wiki/Wikipedia_to_the_Moon - https://phabricator.wikimedia.org/T136557#2359006 (10Dzahn) 05Open>03Resolved a:03Dzahn @MartinRulsch @Heather w... [20:44:23] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Create moon.wikimedia.org and redirect it to https://meta.wikimedia.org/wiki/Wikipedia_to_the_Moon - https://phabricator.wikimedia.org/T136557#2359067 (10Dzahn) [20:54:45] 06Operations, 10Mail, 10OTRS: otrs email outage tracking task - https://phabricator.wikimedia.org/T137145#2358759 (10RobH) [20:55:47] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2359124 (10Dzahn) p:05Triage>03Low [20:56:51] 06Operations, 10ops-eqiad: decom magnesium (data center) - https://phabricator.wikimedia.org/T137006#2359130 (10Dzahn) a:05Dzahn>03None [20:58:16] 06Operations, 10Icinga, 10Monitoring: re-create script for manual paging - https://phabricator.wikimedia.org/T82937#2359136 (10Dzahn) tried that and doesnt work as expected. the custom message is not showing up, just shows OK [20:59:05] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2359141 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/292940/ gerrit-replication to furud has been removed, see comments on... [21:24:57] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2359234 (10Jalexander) Thanks guys, do we have sense on the time line for now? I've delayed more on-boarding a couple... [21:35:40] (03CR) 1020after4: [C: 031] "is this ready to go?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287741 (owner: 10Ottomata) [21:38:31] 06Operations, 06Research-and-Data-Backlog, 10Research-management, 06Revision-Scoring-As-A-Service, and 3 others: [Epic] Deploy Revscoring/ORES service in Prod - https://phabricator.wikimedia.org/T106867#2359253 (10akosiaris) [21:39:07] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2359254 (10akosiaris) 05Open>03Resolved Done by puppet patch above. Resolving [21:52:15] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2359306 (10Dzahn) [21:53:30] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2346094 (10Dzahn) 05Open>03Resolved mailed Steph, Mia, Marc and Heather. Apparently both Steph and Mia don't work for Mule anymore. This worries me a bit because they were the ones already famili... [21:57:47] (03PS1) 10MaxSem: Enable $wgGeoDataDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293037 [22:05:32] !log aaron@tin Synchronized php-1.28.0-wmf.4/includes/api/ApiStashEdit.php: 50ce579046e07 (duration: 00m 23s) [22:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:20:59] (03PS10) 10GWicke: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [22:25:03] 06Operations, 10Analytics, 10MediaWiki-extensions-CentralNotice, 10Traffic: Generate a list of junk CN cookies being sent by clients - https://phabricator.wikimedia.org/T132374#2359414 (10AndyRussG) [22:28:34] (03CR) 10Hashar: [C: 031] remove furud from site.pp,dhcp,installserver [puppet] - 10https://gerrit.wikimedia.org/r/292971 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [22:32:51] (03PS11) 10GWicke: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [22:36:46] (03PS12) 10GWicke: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [22:37:40] (03CR) 10GWicke: Logstash_checker script for canary deploys (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [22:38:01] (03CR) 10jenkins-bot: [V: 04-1] Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [22:38:34] (03PS13) 10GWicke: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [22:42:29] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in Fundraising HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#2359459 (10BBlack) [22:43:14] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in Fundraising HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#2359472 (10BBlack) [22:47:42] 06Operations, 10Ops-Access-Requests: New SSH key for AWight - https://phabricator.wikimedia.org/T137162#2359476 (10awight) [22:47:43] (03PS2) 10Jdrewniak: T131526 A/B test on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292992 (https://phabricator.wikimedia.org/T131526) [22:49:22] 06Operations, 10Ops-Access-Requests: New SSH key for AWight - https://phabricator.wikimedia.org/T137162#2359476 (10awight) /me wonders how extra subscribers were added when this had "access requests" security [22:50:08] 06Operations, 10Ops-Access-Requests: New SSH key for AWight - https://phabricator.wikimedia.org/T137162#2359491 (10awight) @AndyRussG also, sorry for CCing you here but thought it might be helpful if you have to go through the same steps [22:53:53] 06Operations, 10Ops-Access-Requests: New SSH key for AWight - https://phabricator.wikimedia.org/T137162#2359497 (10Dereckson) a:03faidon Assigning to Faidon for triaging (ops clinic duty) [22:54:06] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in Fundraising HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#2359500 (10BBlack) [22:54:08] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2359499 (10BBlack) [22:55:16] 06Operations, 10Traffic, 07HTTPS: Preload HSTS - https://phabricator.wikimedia.org/T104244#2359505 (10BBlack) [22:55:18] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Preload STS for wikimedia.org - https://phabricator.wikimedia.org/T132685#2359503 (10BBlack) 05Open>03Resolved This is submitted for preload now (which takes an agonizingly long and unpredictable time to reach the chrome list and then browsers...) [22:55:35] 06Operations, 10Traffic, 07HTTPS, 07Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2359507 (10BBlack) [22:55:37] 06Operations, 10Traffic, 07HTTPS: Preload HSTS - https://phabricator.wikimedia.org/T104244#1411365 (10BBlack) 05Open>03Resolved [22:56:06] hey, if any one around. I can't login to tin.eqiad.wmnet [22:56:20] akosiaris: ^ (I'm not sure if you're around [22:56:31] Hi Amir1, I can log through bast3001.wikimedia.org [22:56:38] I am [22:56:47] awesome [22:56:58] Dereckson: I think I don't have access yet [22:57:16] akosiaris: should I do another access request? [22:57:25] I want to deploy the redis fix [22:57:54] 06Operations, 10Traffic, 07HTTPS, 07Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2359537 (10BBlack) [22:58:10] I even fixed the submodule in gerrit https://gerrit.wikimedia.org/r/#/c/293049/ [22:58:14] Amir1: ah, you are missing the deployment group [22:58:30] lemme fix that [22:58:37] thanks :) [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160606T2300). Please do the needful. [23:00:04] bblack jan_drewniak tgr: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:10] (03PS14) 10GWicke: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [23:00:14] I'll do it [23:00:28] MaxSem: ok [23:00:34] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2359543 (10akosiaris) 05Resolved>03Open Adding access to deployment group which was not done in above patch [23:00:53] (03PS3) 10MaxSem: symlink /.well-known/apple-app-site-association to /apple-app-site-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287190 (https://phabricator.wikimedia.org/T130647) (owner: 10Filippo Giunchedi) [23:01:01] <- I'm here, and mine's trivial [23:01:02] (03CR) 10MaxSem: [C: 032] symlink /.well-known/apple-app-site-association to /apple-app-site-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287190 (https://phabricator.wikimedia.org/T130647) (owner: 10Filippo Giunchedi) [23:01:05] * aude needs to eat soon, so would like my patch sooner [23:01:11] (03CR) 10jenkins-bot: [V: 04-1] Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [23:01:35] (03PS1) 10Alexandros Kosiaris: admin: Add ladsgroup to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/293051 (https://phabricator.wikimedia.org/T136406) [23:01:41] or else, i might do it myself later [23:01:47] (03Merged) 10jenkins-bot: symlink /.well-known/apple-app-site-association to /apple-app-site-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287190 (https://phabricator.wikimedia.org/T130647) (owner: 10Filippo Giunchedi) [23:01:52] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] admin: Add ladsgroup to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/293051 (https://phabricator.wikimedia.org/T136406) (owner: 10Alexandros Kosiaris) [23:02:14] aude, you mean GeoData? I can test those :) [23:02:24] i added a wikidata patch [23:02:33] ah, see now [23:02:40] will do after ^ then [23:02:43] ok [23:02:52] jenkins usually takes a while, so could +2 already [23:02:52] MaxSem: can you deploy 292758 to beta and prod separately? [23:03:09] who's doing a swat today? [23:03:15] I [23:03:40] tgr, only for a short period of time [23:03:49] kulio, MaxSem, can you sync the private settings (needs a password update) [23:04:06] we couldn't finish it in the morning swat [23:04:27] MaxSem: sure, I just need a few minutes to check it's working [23:04:47] !log maxsem@tin Synchronized docroot/wikipedia.org/.well-known/apple-app-site-association: https://gerrit.wikimedia.org/r/#q,287190,n,z (duration: 00m 25s) [23:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:04:54] bblack, ^ [23:05:31] (03PS1) 10Alexandros Kosiaris: admin: Fix typo introduced in If2e2be807eea90b7002 [puppet] - 10https://gerrit.wikimedia.org/r/293052 (https://phabricator.wikimedia.org/T136406) [23:05:42] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] admin: Fix typo introduced in If2e2be807eea90b7002 [puppet] - 10https://gerrit.wikimedia.org/r/293052 (https://phabricator.wikimedia.org/T136406) (owner: 10Alexandros Kosiaris) [23:05:49] aude, you've listed a wrong patch [23:06:03] we are on wmf.3 of wikidata / wikibase [23:06:09] it should apply to wmf.4 core [23:06:28] wmf.3 where? [23:06:35] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: puppet fail [23:06:35] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: puppet fail [23:06:36] wikidata extension [23:06:39] https://noc.wikimedia.org/conf/highlight.php?file=wikiversions.json [23:06:49] i know [23:06:56] PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: puppet fail [23:07:05] it's only the extension submodule branch [23:07:05] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: puppet fail [23:07:06] PROBLEM - puppet last run on db1061 is CRITICAL: CRITICAL: puppet fail [23:07:15] PROBLEM - puppet last run on elastic2018 is CRITICAL: CRITICAL: puppet fail [23:07:15] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: puppet fail [23:07:16] PROBLEM - puppet last run on mw1016 is CRITICAL: CRITICAL: puppet fail [23:07:24] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: puppet fail [23:07:25] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: puppet fail [23:07:25] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: puppet fail [23:07:25] PROBLEM - puppet last run on es2011 is CRITICAL: CRITICAL: puppet fail [23:07:26] PROBLEM - puppet last run on lvs1010 is CRITICAL: CRITICAL: puppet fail [23:07:33] if i merge that, it will appear in wmf.3, not wmf.4 [23:07:34] PROBLEM - puppet last run on mw2115 is CRITICAL: CRITICAL: puppet fail [23:07:34] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: puppet fail [23:07:35] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: puppet fail [23:07:35] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: puppet fail [23:07:35] PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: puppet fail [23:07:44] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: puppet fail [23:07:44] ignore these ^ [23:07:44] PROBLEM - puppet last run on mw2180 is CRITICAL: CRITICAL: puppet fail [23:07:45] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: puppet fail [23:07:45] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: puppet fail [23:07:46] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: puppet fail [23:07:51] (= it will take a manual commit) [23:07:52] (03PS1) 10Alexandros Kosiaris: admin: One more try to get it right [puppet] - 10https://gerrit.wikimedia.org/r/293053 [23:07:54] PROBLEM - puppet last run on hassaleh is CRITICAL: CRITICAL: puppet fail [23:07:54] PROBLEM - puppet last run on elastic2015 is CRITICAL: CRITICAL: puppet fail [23:07:55] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: puppet fail [23:07:55] PROBLEM - puppet last run on db1038 is CRITICAL: CRITICAL: puppet fail [23:08:00] MaxSem: if it's not automatic in wmf.4, then can take care of it [23:08:19] aude, go ahead [23:08:24] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] admin: One more try to get it right [puppet] - 10https://gerrit.wikimedia.org/r/293053 (owner: 10Alexandros Kosiaris) [23:09:02] but it sounds something's wrong: reverting an extension should be done differently, by resetting wmf.4 to wmf.3 [23:09:16] ok [23:09:40] MaxSem: we just don't make a new wikidata branch every week [23:09:49] :OOOOO [23:09:54] it's special :) [23:10:04] someday it would be nice for it to be handled normally [23:10:25] please deploy yourself, I'm not too confident [23:10:36] ok [23:10:58] if i give it -2, will it not merge? [23:11:04] then i can take care of it after i eat? [23:11:18] yup. or remove my +2 [23:11:20] ok [23:11:24] 06Operations, 10Ops-Access-Requests: New SSH key for AWight - https://phabricator.wikimedia.org/T137162#2359476 (10Krenair) >>! In T137162#2359489, @awight wrote: > /me wonders how extra subscribers were added when this had "access requests" security Access requests doesn't make your request private, it's sup... [23:11:36] (03PS3) 10MaxSem: T131526 A/B test on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292992 (https://phabricator.wikimedia.org/T131526) (owner: 10Jdrewniak) [23:11:40] Amir1: you should be able to proceed now [23:11:43] (03CR) 10MaxSem: [C: 032] T131526 A/B test on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292992 (https://phabricator.wikimedia.org/T131526) (owner: 10Jdrewniak) [23:11:52] o/ [23:12:02] ok [23:12:03] akosiaris: thanks :) [23:12:06] (03CR) 10Luke081515: [C: 031] User rights configuration for meta. wmf-supportsafety group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292518 (https://phabricator.wikimedia.org/T136864) (owner: 10Dereckson) [23:12:37] (03Merged) 10jenkins-bot: T131526 A/B test on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292992 (https://phabricator.wikimedia.org/T131526) (owner: 10Jdrewniak) [23:13:51] !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: https://gerrit.wikimedia.org/r/#/c/292992/ (duration: 00m 30s) [23:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:23] !log maxsem@tin Synchronized portals: https://gerrit.wikimedia.org/r/#/c/292992/ (duration: 00m 31s) [23:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:34] jan_drewniak, ^ [23:15:25] MaxSem: looks good, thanks! [23:16:00] MaxSem, poke me when you are ready to sync [23:16:13] (03PS2) 10MaxSem: Enable $wgGeoDataDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293037 [23:16:21] (03CR) 10MaxSem: [C: 032] Enable $wgGeoDataDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293037 (owner: 10MaxSem) [23:17:07] (03Merged) 10jenkins-bot: Enable $wgGeoDataDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293037 (owner: 10MaxSem) [23:17:54] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/293037/ (duration: 00m 24s) [23:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:46] MaxSem: confirmed 4xx dropoff, apple-app-site-association thing had the right real-world effect [23:21:42] !log deploying ae71d84 into ores in prod [23:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:25] yurik, whatcha need deployed? [23:22:25] akosiaris: 23:21:53 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'worker', 'fetch'] on scb1001.eqiad.wmnet returned [255]: Agent admitted failure to sign using the key. [23:22:25] Permission denied (publickey,keyboard-interactive). [23:22:49] ladsgroup@tin:/srv/deployment/ores/deploy$ scap deploy -v [23:22:50] MaxSem, i need to regen a password for zero. Give me a sec, [23:24:37] Amir1: ah the deploy-service issue... scap3 can be complicated...fixing [23:25:20] thanks :) [23:26:22] (03PS1) 10Alexandros Kosiaris: admin: Add ladsgroup/halfak to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/293056 (https://phabricator.wikimedia.org/T136406) [23:27:01] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] admin: Add ladsgroup/halfak to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/293056 (https://phabricator.wikimedia.org/T136406) (owner: 10Alexandros Kosiaris) [23:27:47] akosiaris: thanks, we need to wait until puppetmasters catch up [23:28:03] MaxSem, done, /srv/mediawiki-staging/private/PrivateSettings.php [23:28:33] 06Operations, 10Traffic: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2357668 (10BBlack) Which hosts were exhibiting this? The first one I looked at (cp4001) seems normal. [23:28:54] MaxSem, just commited it to the local git [23:29:28] !log maxsem@tin Synchronized private/PrivateSettings.php: Updated Zero password (duration: 00m 25s) [23:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:38] yurik, ^ [23:29:51] MaxSem, thx :) [23:30:04] watching logstash... [23:30:39] Amir1: fixed as well [23:31:20] yay [23:31:24] let's deploy [23:31:58] !log maxsem@tin Synchronized php-1.28.0-wmf.4/extensions/GeoData/: (no message) (duration: 00m 25s) [23:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:09] MaxSem, ^^^ [23:32:13] wait, oh... [23:33:08] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [23:33:28] RECOVERY - puppet last run on lvs1010 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [23:33:38] RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [23:33:38] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [23:36:59] tgr, ready? [23:37:07] MaxSem: yes [23:37:16] 06Operations, 10Traffic: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2359694 (10BBlack) I found a few. It seems to be all the esams hosts, plus two of the eqiad hosts ( cp1051, cp1061 ). [23:37:40] (03PS6) 10MaxSem: Apply AbuseFilter configuration syntax change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [23:37:48] (03CR) 10MaxSem: [C: 032] Apply AbuseFilter configuration syntax change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [23:38:10] (03PS1) 10Faidon Liambotis: exim: add wmflabs.org to wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/293057 [23:38:27] (03Merged) 10jenkins-bot: Apply AbuseFilter configuration syntax change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292758 (owner: 10Gergő Tisza) [23:39:36] tgr, please tell me when I should sync/revert [23:39:44] (03CR) 10Faidon Liambotis: [C: 032] exim: add wmflabs.org to wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/293057 (owner: 10Faidon Liambotis) [23:39:58] MaxSem: looks OK, please sync [23:41:10] !log maxsem@tin Synchronized wmf-config/abusefilter.php: https://gerrit.wikimedia.org/r/#/c/292758/ (duration: 00m 24s) [23:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:19] tgr, ^ [23:42:55] akosiaris: it seems codfw nodes can't connect to the tin (they fail) [23:43:04] 23:41:06 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'worker', 'promote'] on scb2002.codfw.wmnet returned [70]: 23:40:52 INFO - Starting new HTTP connection (1): deployment.eqiad.wmnet [23:44:09] Amir1: niah, that's a red herring. It's the sudo /usr/sbin/service ores restart command failing [23:44:26] oh [23:44:59] MaxSem: looks good as well, thanks! [23:45:15] aude, all yours when you return [23:45:16] due to ores still being called uwsgi-ores that is. Should be fixed once I merge https://gerrit.wikimedia.org/r/#/c/291751/ [23:45:52] hmm [23:45:54] okay [23:48:58] Amir1: https://ores.wikimedia.org/v2/scores/enwiki/?models=damaging&revids=724030089 [23:48:59] here we go [23:49:16] yessss [23:49:18] yesss [23:49:20] yess [23:49:24] yes [23:49:26] ye [23:49:28] y [23:49:34] halfak: ^ [23:49:42] * Amir1 is afk for dancing [23:50:27] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2359750 (10akosiaris) 05Open>03Resolved Above patches solved the problem. Re-resolving [23:52:57] PROBLEM - ores on scb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 104 bytes in 0.004 second response time [23:53:37] PROBLEM - ores on scb2002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 104 bytes in 0.067 second response time [23:54:22] I think that was us ^ [23:54:23] and this one as well [23:54:28] yup [23:54:47] I think will be fixed by the same commit [23:54:53] it's an erroneuous check [23:55:00] checks a unit files that does not yet exist [23:55:26] akosiaris: it's 3AM! [23:55:32] oh, no that does not sound right [23:55:46] ori: yes, it's obvious in my problematic debugging [23:56:02] * yuvipanda calls ori kettle [23:56:59] it's 4:26 AM here [23:57:01] :D