[00:03:15] !log krenair Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/228198/ (duration: 00m 12s) [00:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:15:38] 36 [00:15:46] bah, sorry [00:18:37] RECOVERY - puppet last run on mw1047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:29:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [01:30:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 6 below the confidence bounds [01:47:55] !log starting OSC gerrit 228756 s5 wb_items_per_site.ips_site_page [01:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:02:59] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [02:20:00] !log l10nupdate Synchronized php-1.26wmf16/cache/l10n: (no message) (duration: 06m 21s) [02:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:21] !log @tin LocalisationUpdate completed (1.26wmf16) at 2015-08-03 02:23:21+00:00 [02:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 7 below the confidence bounds [04:02:17] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 10.71% of data above the critical threshold [100000000.0] [04:17:08] 100000000.0 [04:17:25] Reedy: that's over 9,000, right? [04:22:57] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [04:27:37] (03Abandoned) 10KartikMistry: Beta: Add cxserver::restbase URL [puppet] - 10https://gerrit.wikimedia.org/r/222247 (owner: 10KartikMistry) [04:46:57] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [05:22:15] !log @tin ResourceLoader cache refresh completed at Mon Aug 3 05:22:15 UTC 2015 (duration 22m 14s) [05:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:47:50] (03CR) 10Giuseppe Lavagetto: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/228015 (https://phabricator.wikimedia.org/T107286) (owner: 10Giuseppe Lavagetto) [05:48:40] (03PS2) 10Giuseppe Lavagetto: Add mobileapps LVS IP [dns] - 10https://gerrit.wikimedia.org/r/227724 (https://phabricator.wikimedia.org/T92627) [05:50:28] (03CR) 10Giuseppe Lavagetto: [C: 032] Add mobileapps LVS IP [dns] - 10https://gerrit.wikimedia.org/r/227724 (https://phabricator.wikimedia.org/T92627) (owner: 10Giuseppe Lavagetto) [06:15:06] PROBLEM - salt-minion processes on analytics1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:06] PROBLEM - Hadoop DataNode on analytics1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:28] PROBLEM - dhclient process on analytics1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:18:43] (03PS2) 10Giuseppe Lavagetto: conftool: add write-locks to syncer and confctl [software/conftool] - 10https://gerrit.wikimedia.org/r/228015 (https://phabricator.wikimedia.org/T107286) [06:22:37] (03PS1) 10Muehlenhoff: Add reference for previously fixed security issue [debs/linux] - 10https://gerrit.wikimedia.org/r/228779 [06:23:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add reference for previously fixed security issue [debs/linux] - 10https://gerrit.wikimedia.org/r/228779 (owner: 10Muehlenhoff) [06:23:32] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add write-locks to syncer and confctl [software/conftool] - 10https://gerrit.wikimedia.org/r/228015 (https://phabricator.wikimedia.org/T107286) (owner: 10Giuseppe Lavagetto) [06:23:48] (03Merged) 10jenkins-bot: conftool: add write-locks to syncer and confctl [software/conftool] - 10https://gerrit.wikimedia.org/r/228015 (https://phabricator.wikimedia.org/T107286) (owner: 10Giuseppe Lavagetto) [06:30:26] PROBLEM - puppet last run on cp3048 is CRITICAL puppet fail [06:31:27] PROBLEM - puppet last run on wtp2017 is CRITICAL Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on mc2015 is CRITICAL Puppet has 2 failures [06:31:56] PROBLEM - puppet last run on mw1008 is CRITICAL Puppet has 1 failures [06:32:06] PROBLEM - puppet last run on mw1060 is CRITICAL Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on wtp2008 is CRITICAL Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on db1003 is CRITICAL Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on mw1228 is CRITICAL Puppet has 1 failures [06:33:08] PROBLEM - puppet last run on lvs2002 is CRITICAL Puppet has 1 failures [06:46:47] (03PS1) 10Muehlenhoff: Add ferm rules for role::mariadb::misc::phabricator [puppet] - 10https://gerrit.wikimedia.org/r/228782 (https://phabricator.wikimedia.org/T104699) [06:48:16] PROBLEM - puppet last run on iodine is CRITICAL Puppet has 1 failures [06:54:56] RECOVERY - puppet last run on mw1228 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:55:48] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:55:53] (03PS4) 10Muehlenhoff: Add ferm rules for puppet master backends [puppet] - 10https://gerrit.wikimedia.org/r/226501 [06:55:56] RECOVERY - puppet last run on mc2015 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:55:57] RECOVERY - puppet last run on mw1008 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:56:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for puppet master backends [puppet] - 10https://gerrit.wikimedia.org/r/226501 (owner: 10Muehlenhoff) [06:56:08] RECOVERY - puppet last run on mw1060 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:56:09] RECOVERY - puppet last run on wtp2008 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:57:07] RECOVERY - puppet last run on lvs2002 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:57:10] (03PS1) 10Muehlenhoff: Add ferm rules for role::mariadb::misc [puppet] - 10https://gerrit.wikimedia.org/r/228783 (https://phabricator.wikimedia.org/T104699) [06:57:36] RECOVERY - puppet last run on wtp2017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:37] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:17] RECOVERY - puppet last run on db1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:36] RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:56] (03PS1) 10Muehlenhoff: Enable ferm for puppetmaster backends [puppet] - 10https://gerrit.wikimedia.org/r/228784 [07:14:28] RECOVERY - puppet last run on iodine is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:26:47] 6operations: Puppet catalog compiler is broken - https://phabricator.wikimedia.org/T96802#1502386 (10Joe) p:5Low>3High [07:49:12] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1502396 (10Joe) Stripping cookies at the varnish layer is possible, not adviceable in general IMO. [07:49:23] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1502397 (10Joe) a:3Joe [07:52:29] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Assign an LVS service to the wikidata query service - https://phabricator.wikimedia.org/T107601#1502401 (10Joe) @Smalyshev the load balancer will depool one server if it goes down in after a very short interval. We're working on a tool to pool/... [08:38:56] (03PS2) 10Muehlenhoff: ferm rules for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [08:44:55] moritzm: I think we've switched nutcracker to a unix socket nowadays [08:46:40] and in any case, the ferm rules probably don't belong in the nutcracker class, as one can supply a config defining different ports etc. [08:51:34] paravoid: thanks, I'll doublecheck [09:13:13] (03PS3) 10Muehlenhoff: ferm rules for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [09:27:43] (03PS1) 10Jcrespo: Depool db1035 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228787 [09:29:17] (03CR) 10Jcrespo: [C: 032] Depool db1035 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228787 (owner: 10Jcrespo) [09:31:12] !log jynus Synchronized wmf-config/db-eqiad.php: depool db1035 for maintenance (duration: 00m 12s) [09:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:40:46] PROBLEM - puppet last run on analytics1015 is CRITICAL puppet fail [09:49:43] (03CR) 10Lokal Profil: "Done (had forgot to push the change)" [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [09:51:25] not good, BIOS on db1035 is complaining about lack of power [09:55:25] (03PS1) 10Muehlenhoff: Add ferm rules for Hue server [puppet] - 10https://gerrit.wikimedia.org/r/228788 (https://phabricator.wikimedia.org/T83597) [09:58:58] (03PS7) 10Lokal Profil: Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) [10:01:01] (03CR) 10Lokal Profil: "Now fully synced again. In general I would say main development here then I can mirror it to github as a stand-alone (for now at least)" [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [10:02:04] (03PS1) 10Muehlenhoff: Add ferm rules for Hive server [puppet] - 10https://gerrit.wikimedia.org/r/228791 (https://phabricator.wikimedia.org/T83597) [10:07:35] (03PS2) 10Muehlenhoff: Add ferm rules for Hive server/metastore [puppet] - 10https://gerrit.wikimedia.org/r/228791 (https://phabricator.wikimedia.org/T83597) [10:10:16] (03PS1) 10Muehlenhoff: Add ferm rules for Oozie HTTP interface [puppet] - 10https://gerrit.wikimedia.org/r/228792 (https://phabricator.wikimedia.org/T83597) [10:21:59] 6operations, 10ops-eqiad: db1035 died - network or power problem - https://phabricator.wikimedia.org/T107746#1502590 (10jcrespo) 3NEW [10:29:47] PROBLEM - RAID on analytics1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:56] PROBLEM - DPKG on analytics1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:37] PROBLEM - DPKG on radon is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:30:57] PROBLEM - DPKG on eeden is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:31:06] PROBLEM - Disk space on analytics1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:31:06] PROBLEM - Disk space on Hadoop worker on analytics1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:31:27] PROBLEM - configured eth on analytics1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:32:19] 6operations: Puppet catalog compiler is broken - https://phabricator.wikimedia.org/T96802#1502597 (10Joe) For the record, I'm working at a more stable version of this code that will be a fusion of what @akosiaris wrote as a shell script and what I wrote for the current compiler. Basically I'm substituting my bad... [10:34:47] PROBLEM - puppet last run on eeden is CRITICAL Puppet has 1 failures [10:39:37] PROBLEM - puppet last run on radon is CRITICAL Puppet has 1 failures [10:45:07] RECOVERY - DPKG on radon is OK: All packages OK [10:45:11] !log upgrading all AuthDNS servers to gdnsd 2.2.0 [10:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:45:26] RECOVERY - DPKG on eeden is OK: All packages OK [10:45:48] RECOVERY - puppet last run on radon is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [10:47:06] RECOVERY - puppet last run on eeden is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:50:11] (03Abandoned) 10Faidon Liambotis: Switch *.{wap,mobile}.wikipedia.org to wikipedia-lb [dns] - 10https://gerrit.wikimedia.org/r/98055 (owner: 10Faidon Liambotis) [10:51:37] (03CR) 10Jakob: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T101235) (owner: 10WMDE-leszek) [10:52:13] 6operations, 7Database: New s3 production cluster for mysql - https://phabricator.wikimedia.org/T106847#1502622 (10jcrespo) p:5Low>3Normal db1035 died. [10:53:20] (03PS4) 10Faidon Liambotis: Switch config to the City variant of GeoIP2 [dns] - 10https://gerrit.wikimedia.org/r/228233 [10:53:21] (03PS2) 10Faidon Liambotis: Kill only-primary-map, unused and redundant [dns] - 10https://gerrit.wikimedia.org/r/228236 [10:54:23] (03CR) 10Faidon Liambotis: [C: 032] Kill only-primary-map, unused and redundant [dns] - 10https://gerrit.wikimedia.org/r/228236 (owner: 10Faidon Liambotis) [10:54:38] (03CR) 10Faidon Liambotis: [C: 032] Switch config to the City variant of GeoIP2 [dns] - 10https://gerrit.wikimedia.org/r/228233 (owner: 10Faidon Liambotis) [10:56:01] !log switching GeoDNS to GeoIP2 [10:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:57:09] 6operations, 10Traffic: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#1502648 (10faidon) [10:57:11] 6operations, 10Traffic, 5Patch-For-Review: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1502646 (10faidon) 5Open>3Resolved This is now done :) [10:57:32] 6operations, 10Traffic: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1502652 (10faidon) [10:57:48] (03PS1) 10Jcrespo: Repool db1027 and db1015 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228797 [10:58:46] (03CR) 10Jcrespo: [C: 032] Repool db1027 and db1015 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228797 (owner: 10Jcrespo) [10:59:49] 6operations, 10Traffic: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#1502657 (10faidon) We've now switched to both gdnsd 2.2.0 and GeoIP2, which comes with non-lite City IPv6 support. Next steps are evaluating somehow whether that support is suffic... [11:01:05] !log jynus Synchronized wmf-config/db-eqiad.php: avoid db1044 SPOF by repooling db1027 and db1015 (duration: 00m 12s) [11:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:35] (03PS1) 10Faidon Liambotis: lvs: remove {bits,text,upload,mobile}svc lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/228800 [11:06:38] (03PS1) 10Faidon Liambotis: Remove {bits,text,upload,mobile}.svc.$site.wmnet [dns] - 10https://gerrit.wikimedia.org/r/228801 [12:21:54] <_joe_> !log bumped ganglia-monitor-aggregator on bast4001, the upstart script needs immediate fixing [12:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:27] (03PS1) 10Muehlenhoff: Add ferm rules for role::mariadb::core [puppet] - 10https://gerrit.wikimedia.org/r/228804 (https://phabricator.wikimedia.org/T104699) [12:32:39] 6operations, 10Traffic: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1502780 (10BBlack) 3NEW a:3BBlack [12:33:05] (03PS2) 10BBlack: tlsproxy: let nginx use keepalives to varnish [puppet] - 10https://gerrit.wikimedia.org/r/228564 (https://phabricator.wikimedia.org/T107749) [12:34:04] (03CR) 10BBlack: [C: 031] lvs: remove {bits,text,upload,mobile}svc lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/228800 (owner: 10Faidon Liambotis) [12:34:40] (03CR) 10BBlack: [C: 031] Remove {bits,text,upload,mobile}.svc.$site.wmnet [dns] - 10https://gerrit.wikimedia.org/r/228801 (owner: 10Faidon Liambotis) [12:41:12] (03PS1) 10Giuseppe Lavagetto: ganglia-monitor-aggregator: fix upstart script [puppet] - 10https://gerrit.wikimedia.org/r/228805 [12:42:11] (03PS1) 10Muehlenhoff: Add ferm rules for coredb classes [puppet] - 10https://gerrit.wikimedia.org/r/228806 (https://phabricator.wikimedia.org/T104699) [12:45:17] PROBLEM - puppet last run on analytics1014 is CRITICAL puppet fail [12:45:27] PROBLEM - puppet last run on db1029 is CRITICAL puppet fail [12:45:47] PROBLEM - puppet last run on mw1032 is CRITICAL puppet fail [12:45:56] PROBLEM - puppet last run on mc1004 is CRITICAL puppet fail [12:45:57] PROBLEM - puppet last run on labsdb1004 is CRITICAL puppet fail [12:45:57] PROBLEM - puppet last run on logstash1001 is CRITICAL puppet fail [12:46:06] PROBLEM - puppet last run on cp2007 is CRITICAL puppet fail [12:46:06] PROBLEM - puppet last run on db2070 is CRITICAL puppet fail [12:46:06] PROBLEM - puppet last run on db2046 is CRITICAL puppet fail [12:46:07] PROBLEM - puppet last run on mw2161 is CRITICAL puppet fail [12:46:07] PROBLEM - puppet last run on mw2201 is CRITICAL puppet fail [12:46:07] PROBLEM - puppet last run on mw2153 is CRITICAL puppet fail [12:46:07] PROBLEM - puppet last run on mw2139 is CRITICAL puppet fail [12:46:08] PROBLEM - puppet last run on cp2025 is CRITICAL puppet fail [12:46:08] PROBLEM - puppet last run on mw2160 is CRITICAL puppet fail [12:46:09] PROBLEM - puppet last run on wtp2002 is CRITICAL puppet fail [12:46:09] PROBLEM - puppet last run on mw2170 is CRITICAL puppet fail [12:46:16] PROBLEM - puppet last run on titanium is CRITICAL puppet fail [12:46:17] PROBLEM - puppet last run on cp1049 is CRITICAL puppet fail [12:46:17] PROBLEM - puppet last run on elastic1017 is CRITICAL puppet fail [12:46:17] PROBLEM - puppet last run on dbstore1001 is CRITICAL puppet fail [12:46:17] PROBLEM - puppet last run on mw2116 is CRITICAL puppet fail [12:46:17] PROBLEM - puppet last run on mw1201 is CRITICAL puppet fail [12:46:18] PROBLEM - puppet last run on mw2031 is CRITICAL puppet fail [12:46:18] PROBLEM - puppet last run on db1064 is CRITICAL puppet fail [12:46:26] PROBLEM - puppet last run on mc2012 is CRITICAL puppet fail [12:46:26] PROBLEM - puppet last run on mw2009 is CRITICAL puppet fail [12:46:26] PROBLEM - puppet last run on analytics1043 is CRITICAL puppet fail [12:46:26] PROBLEM - puppet last run on mc2002 is CRITICAL puppet fail [12:46:27] PROBLEM - puppet last run on ms-be2002 is CRITICAL puppet fail [12:46:27] PROBLEM - puppet last run on rdb2002 is CRITICAL puppet fail [12:46:27] PROBLEM - puppet last run on ms-be1018 is CRITICAL puppet fail [12:46:27] PROBLEM - puppet last run on wtp1015 is CRITICAL puppet fail [12:46:28] PROBLEM - puppet last run on rdb1003 is CRITICAL puppet fail [12:46:36] PROBLEM - puppet last run on dbproxy1002 is CRITICAL puppet fail [12:46:36] PROBLEM - puppet last run on mw1041 is CRITICAL puppet fail [12:46:37] PROBLEM - puppet last run on db1044 is CRITICAL puppet fail [12:46:37] PROBLEM - puppet last run on lvs1002 is CRITICAL puppet fail [12:46:46] PROBLEM - puppet last run on helium is CRITICAL puppet fail [12:46:47] PROBLEM - puppet last run on mw2199 is CRITICAL puppet fail [12:46:47] PROBLEM - puppet last run on mw2188 is CRITICAL puppet fail [12:46:47] PROBLEM - puppet last run on mw1024 is CRITICAL puppet fail [12:46:48] PROBLEM - puppet last run on mw1225 is CRITICAL puppet fail [12:46:48] PROBLEM - puppet last run on mw2164 is CRITICAL puppet fail [12:46:56] PROBLEM - puppet last run on mw2105 is CRITICAL puppet fail [12:46:56] PROBLEM - puppet last run on mw2100 is CRITICAL puppet fail [12:46:56] PROBLEM - puppet last run on db2009 is CRITICAL puppet fail [12:46:56] PROBLEM - puppet last run on mw2064 is CRITICAL puppet fail [12:46:57] PROBLEM - puppet last run on mc2009 is CRITICAL puppet fail [12:46:57] PROBLEM - puppet last run on ms-be2006 is CRITICAL puppet fail [12:46:57] PROBLEM - puppet last run on ganeti2001 is CRITICAL puppet fail [12:46:57] PROBLEM - puppet last run on mw1063 is CRITICAL puppet fail [12:47:06] PROBLEM - puppet last run on mw1140 is CRITICAL puppet fail [12:47:06] PROBLEM - puppet last run on wtp1007 is CRITICAL puppet fail [12:47:06] PROBLEM - puppet last run on mw1106 is CRITICAL puppet fail [12:47:07] PROBLEM - puppet last run on mw1007 is CRITICAL puppet fail [12:47:07] PROBLEM - puppet last run on mw1254 is CRITICAL puppet fail [12:47:17] PROBLEM - puppet last run on db1031 is CRITICAL puppet fail [12:47:18] PROBLEM - puppet last run on mw1122 is CRITICAL puppet fail [12:47:18] PROBLEM - puppet last run on mw1033 is CRITICAL puppet fail [12:47:27] PROBLEM - puppet last run on analytics1048 is CRITICAL puppet fail [12:47:27] PROBLEM - puppet last run on db1033 is CRITICAL puppet fail [12:47:27] PROBLEM - puppet last run on fluorine is CRITICAL puppet fail [12:47:27] PROBLEM - puppet last run on mw1069 is CRITICAL puppet fail [12:47:36] PROBLEM - puppet last run on analytics1033 is CRITICAL puppet fail [12:47:47] PROBLEM - puppet last run on potassium is CRITICAL puppet fail [12:47:47] PROBLEM - puppet last run on ms-be1006 is CRITICAL puppet fail [12:47:48] PROBLEM - puppet last run on mw1091 is CRITICAL puppet fail [12:47:48] PROBLEM - puppet last run on calcium is CRITICAL puppet fail [12:47:56] PROBLEM - puppet last run on mw1197 is CRITICAL puppet fail [12:47:56] PROBLEM - puppet last run on ganeti1004 is CRITICAL puppet fail [12:47:57] PROBLEM - puppet last run on mw1010 is CRITICAL puppet fail [12:47:57] PROBLEM - puppet last run on cp2023 is CRITICAL puppet fail [12:48:07] PROBLEM - puppet last run on db2059 is CRITICAL puppet fail [12:48:08] PROBLEM - puppet last run on planet1001 is CRITICAL puppet fail [12:48:08] PROBLEM - puppet last run on ms-fe2004 is CRITICAL puppet fail [12:48:08] PROBLEM - puppet last run on wtp2005 is CRITICAL puppet fail [12:48:08] PROBLEM - puppet last run on wtp2020 is CRITICAL puppet fail [12:48:08] PROBLEM - puppet last run on db2039 is CRITICAL puppet fail [12:48:08] PROBLEM - puppet last run on wtp2001 is CRITICAL puppet fail [12:48:08] PROBLEM - puppet last run on mw2163 is CRITICAL puppet fail [12:48:08] PROBLEM - puppet last run on iodine is CRITICAL puppet fail [12:48:09] PROBLEM - puppet last run on mw1068 is CRITICAL puppet fail [12:48:18] PROBLEM - puppet last run on db1050 is CRITICAL puppet fail [12:48:19] PROBLEM - puppet last run on mw1173 is CRITICAL puppet fail [12:48:19] PROBLEM - puppet last run on db1002 is CRITICAL puppet fail [12:48:19] PROBLEM - puppet last run on mw2087 is CRITICAL puppet fail [12:48:19] PROBLEM - puppet last run on elastic1004 is CRITICAL puppet fail [12:48:20] PROBLEM - puppet last run on es2008 is CRITICAL puppet fail [12:48:20] PROBLEM - puppet last run on mw2082 is CRITICAL puppet fail [12:48:20] PROBLEM - puppet last run on mw1150 is CRITICAL puppet fail [12:48:20] PROBLEM - puppet last run on mw2075 is CRITICAL puppet fail [12:48:21] PROBLEM - puppet last run on mw2010 is CRITICAL puppet fail [12:48:21] PROBLEM - puppet last run on mw2004 is CRITICAL puppet fail [12:48:26] PROBLEM - puppet last run on mw2015 is CRITICAL puppet fail [12:48:26] PROBLEM - puppet last run on mw2019 is CRITICAL puppet fail [12:48:27] PROBLEM - puppet last run on mc2016 is CRITICAL puppet fail [12:48:27] PROBLEM - puppet last run on mw2080 is CRITICAL puppet fail [12:48:27] PROBLEM - puppet last run on ganeti2004 is CRITICAL puppet fail [12:48:27] PROBLEM - puppet last run on ms-be2003 is CRITICAL puppet fail [12:48:35] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1502803 (10BBlack) Testing on an upload cache (cp1071, 48 worker processes) yields starkly different results: Before: ``` root@cp1071:~# netstat -an|grep '10.64.48.105:80[... [12:48:36] PROBLEM - puppet last run on elastic1008 is CRITICAL puppet fail [12:48:36] PROBLEM - puppet last run on mw1142 is CRITICAL puppet fail [12:48:37] PROBLEM - puppet last run on mw1143 is CRITICAL puppet fail [12:48:37] PROBLEM - puppet last run on mw1046 is CRITICAL puppet fail [12:48:37] PROBLEM - puppet last run on analytics1004 is CRITICAL puppet fail [12:48:37] PROBLEM - puppet last run on einsteinium is CRITICAL puppet fail [12:48:38] PROBLEM - puppet last run on analytics1003 is CRITICAL puppet fail [12:48:47] PROBLEM - puppet last run on analytics1035 is CRITICAL puppet fail [12:48:47] PROBLEM - puppet last run on db2045 is CRITICAL puppet fail [12:48:47] PROBLEM - puppet last run on ms-be1015 is CRITICAL puppet fail [12:48:47] PROBLEM - puppet last run on mw2196 is CRITICAL puppet fail [12:48:56] PROBLEM - puppet last run on lvs2004 is CRITICAL puppet fail [12:48:56] PROBLEM - puppet last run on mw2109 is CRITICAL puppet fail [12:48:56] PROBLEM - puppet last run on mw2134 is CRITICAL puppet fail [12:48:56] PROBLEM - puppet last run on achernar is CRITICAL puppet fail [12:48:56] PROBLEM - puppet last run on mw2083 is CRITICAL puppet fail [12:48:57] PROBLEM - puppet last run on db2054 is CRITICAL puppet fail [12:48:57] PROBLEM - puppet last run on ms-fe1001 is CRITICAL puppet fail [12:48:57] PROBLEM - puppet last run on lvs3003 is CRITICAL puppet fail [12:49:06] PROBLEM - puppet last run on mw1107 is CRITICAL puppet fail [12:49:06] PROBLEM - puppet last run on mw1166 is CRITICAL puppet fail [12:49:06] PROBLEM - puppet last run on mw1153 is CRITICAL puppet fail [12:49:07] PROBLEM - puppet last run on bast1001 is CRITICAL puppet fail [12:49:07] PROBLEM - puppet last run on mw1131 is CRITICAL puppet fail [12:49:17] PROBLEM - puppet last run on mw1066 is CRITICAL puppet fail [12:49:17] PROBLEM - puppet last run on cp3021 is CRITICAL puppet fail [12:49:17] PROBLEM - puppet last run on cp3040 is CRITICAL puppet fail [12:49:17] PROBLEM - puppet last run on cp3018 is CRITICAL puppet fail [12:49:18] PROBLEM - puppet last run on cp3010 is CRITICAL puppet fail [12:49:18] PROBLEM - puppet last run on mw1235 is CRITICAL puppet fail [12:49:18] PROBLEM - puppet last run on analytics1030 is CRITICAL puppet fail [12:49:26] PROBLEM - puppet last run on elastic1027 is CRITICAL puppet fail [12:49:26] PROBLEM - puppet last run on lvs1005 is CRITICAL puppet fail [12:49:26] PROBLEM - puppet last run on cp4002 is CRITICAL puppet fail [12:49:27] PROBLEM - puppet last run on mw1021 is CRITICAL puppet fail [12:49:36] PROBLEM - puppet last run on cp3047 is CRITICAL puppet fail [12:49:36] PROBLEM - puppet last run on cp3042 is CRITICAL puppet fail [12:49:36] PROBLEM - puppet last run on copper is CRITICAL puppet fail [12:49:36] PROBLEM - puppet last run on mw1204 is CRITICAL puppet fail [12:49:36] PROBLEM - puppet last run on elastic1018 is CRITICAL puppet fail [12:49:37] PROBLEM - puppet last run on ganeti1002 is CRITICAL puppet fail [12:49:37] PROBLEM - puppet last run on mw1118 is CRITICAL puppet fail [12:49:38] PROBLEM - puppet last run on mw1189 is CRITICAL puppet fail [12:49:38] PROBLEM - puppet last run on mw1253 is CRITICAL puppet fail [12:49:46] PROBLEM - puppet last run on db1051 is CRITICAL puppet fail [12:49:47] PROBLEM - puppet last run on ms-fe1002 is CRITICAL puppet fail [12:49:47] PROBLEM - puppet last run on mw1128 is CRITICAL puppet fail [12:49:47] PROBLEM - puppet last run on mw1025 is CRITICAL puppet fail [12:49:47] PROBLEM - puppet last run on labcontrol1001 is CRITICAL puppet fail [12:49:48] PROBLEM - puppet last run on analytics1031 is CRITICAL puppet fail [12:49:57] PROBLEM - puppet last run on mw1241 is CRITICAL puppet fail [12:49:58] PROBLEM - puppet last run on db1066 is CRITICAL puppet fail [12:50:00] PROBLEM - puppet last run on db1022 is CRITICAL puppet fail [12:50:00] PROBLEM - puppet last run on mw1027 is CRITICAL puppet fail [12:50:00] PROBLEM - puppet last run on mw1003 is CRITICAL puppet fail [12:50:00] PROBLEM - puppet last run on palladium is CRITICAL puppet fail [12:50:06] PROBLEM - puppet last run on mw1205 is CRITICAL puppet fail [12:50:07] PROBLEM - puppet last run on mw1154 is CRITICAL puppet fail [12:50:07] PROBLEM - puppet last run on cp2016 is CRITICAL puppet fail [12:50:07] PROBLEM - puppet last run on cp2011 is CRITICAL puppet fail [12:50:07] PROBLEM - puppet last run on etherpad1001 is CRITICAL puppet fail [12:50:07] PROBLEM - puppet last run on cp2003 is CRITICAL puppet fail [12:50:07] PROBLEM - puppet last run on wtp2016 is CRITICAL puppet fail [12:50:08] PROBLEM - puppet last run on mw2143 is CRITICAL puppet fail [12:50:08] PROBLEM - puppet last run on mw2176 is CRITICAL puppet fail [12:50:09] PROBLEM - puppet last run on cp2026 is CRITICAL puppet fail [12:50:09] PROBLEM - puppet last run on mw2142 is CRITICAL puppet fail [12:50:10] PROBLEM - puppet last run on mw2090 is CRITICAL puppet fail [12:50:16] PROBLEM - puppet last run on db1040 is CRITICAL puppet fail [12:50:16] PROBLEM - puppet last run on mw1155 is CRITICAL puppet fail [12:50:16] PROBLEM - puppet last run on db1034 is CRITICAL puppet fail [12:50:17] PROBLEM - puppet last run on ruthenium is CRITICAL puppet fail [12:50:17] PROBLEM - puppet last run on mw2092 is CRITICAL puppet fail [12:50:18] PROBLEM - puppet last run on mw2114 is CRITICAL puppet fail [12:50:18] PROBLEM - puppet last run on mw2110 is CRITICAL puppet fail [12:50:18] PROBLEM - puppet last run on mw2113 is CRITICAL puppet fail [12:50:18] PROBLEM - puppet last run on mw2079 is CRITICAL puppet fail [12:50:18] PROBLEM - puppet last run on mw2093 is CRITICAL puppet fail [12:50:36] PROBLEM - puppet last run on mw1092 is CRITICAL puppet fail [12:50:37] PROBLEM - puppet last run on wtp1001 is CRITICAL puppet fail [12:50:37] PROBLEM - puppet last run on dataset1001 is CRITICAL puppet fail [12:50:37] PROBLEM - puppet last run on labstore1001 is CRITICAL puppet fail [12:50:37] PROBLEM - puppet last run on tmh1001 is CRITICAL puppet fail [12:50:37] PROBLEM - puppet last run on lvs1004 is CRITICAL puppet fail [12:50:46] PROBLEM - puppet last run on mw1230 is CRITICAL puppet fail [12:50:46] PROBLEM - puppet last run on labvirt1009 is CRITICAL puppet fail [12:50:46] PROBLEM - puppet last run on mw1194 is CRITICAL puppet fail [12:50:46] PROBLEM - puppet last run on cp2014 is CRITICAL puppet fail [12:50:47] PROBLEM - puppet last run on db2065 is CRITICAL puppet fail [12:50:47] PROBLEM - puppet last run on db1021 is CRITICAL puppet fail [12:50:47] PROBLEM - puppet last run on mw2212 is CRITICAL puppet fail [12:50:48] PROBLEM - puppet last run on mw2184 is CRITICAL puppet fail [12:50:56] PROBLEM - puppet last run on mw2123 is CRITICAL puppet fail [12:50:56] PROBLEM - puppet last run on mw2131 is CRITICAL puppet fail [12:50:56] PROBLEM - puppet last run on mw1137 is CRITICAL puppet fail [12:50:56] PROBLEM - puppet last run on cp1058 is CRITICAL puppet fail [12:50:57] PROBLEM - puppet last run on mw1211 is CRITICAL puppet fail [12:50:57] PROBLEM - puppet last run on mw1075 is CRITICAL puppet fail [12:50:57] PROBLEM - puppet last run on mw2127 is CRITICAL puppet fail [12:50:58] PROBLEM - puppet last run on es2010 is CRITICAL puppet fail [12:50:58] PROBLEM - puppet last run on mw2056 is CRITICAL puppet fail [12:51:06] PROBLEM - puppet last run on ms-be3002 is CRITICAL puppet fail [12:51:06] PROBLEM - puppet last run on cp4019 is CRITICAL puppet fail [12:51:07] PROBLEM - puppet last run on cp4015 is CRITICAL puppet fail [12:51:07] PROBLEM - puppet last run on mw1206 is CRITICAL puppet fail [12:51:07] PROBLEM - puppet last run on db1027 is CRITICAL puppet fail [12:51:07] PROBLEM - puppet last run on dbproxy1001 is CRITICAL puppet fail [12:51:16] PROBLEM - puppet last run on mw1208 is CRITICAL puppet fail [12:51:18] PROBLEM - puppet last run on mw1047 is CRITICAL puppet fail [12:51:26] PROBLEM - puppet last run on mw1078 is CRITICAL puppet fail [12:51:26] PROBLEM - puppet last run on cp4005 is CRITICAL puppet fail [12:51:27] PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail [12:51:36] PROBLEM - puppet last run on cp3030 is CRITICAL puppet fail [12:51:36] PROBLEM - puppet last run on cp3006 is CRITICAL puppet fail [12:51:36] PROBLEM - puppet last run on multatuli is CRITICAL puppet fail [12:51:36] PROBLEM - puppet last run on elastic1013 is CRITICAL puppet fail [12:51:36] PROBLEM - puppet last run on db1042 is CRITICAL puppet fail [12:51:37] PROBLEM - puppet last run on mw1011 is CRITICAL puppet fail [12:51:46] PROBLEM - puppet last run on db1005 is CRITICAL puppet fail [12:51:46] PROBLEM - puppet last run on erbium is CRITICAL puppet fail [12:51:47] PROBLEM - puppet last run on restbase1009 is CRITICAL puppet fail [12:51:47] PROBLEM - puppet last run on es1004 is CRITICAL puppet fail [12:51:47] PROBLEM - puppet last run on polonium is CRITICAL puppet fail [12:51:47] PROBLEM - puppet last run on db1009 is CRITICAL puppet fail [12:51:48] PROBLEM - puppet last run on sodium is CRITICAL puppet fail [12:51:48] PROBLEM - puppet last run on db1068 is CRITICAL puppet fail [12:51:48] PROBLEM - puppet last run on labsdb1005 is CRITICAL puppet fail [12:51:57] PROBLEM - puppet last run on mw1054 is CRITICAL puppet fail [12:51:57] PROBLEM - puppet last run on logstash1002 is CRITICAL puppet fail [12:52:07] PROBLEM - puppet last run on db2038 is CRITICAL puppet fail [12:52:07] PROBLEM - puppet last run on cp2010 is CRITICAL puppet fail [12:52:07] PROBLEM - puppet last run on ms-fe2003 is CRITICAL puppet fail [12:52:07] PROBLEM - puppet last run on mw2182 is CRITICAL puppet fail [12:52:08] PROBLEM - puppet last run on wtp2010 is CRITICAL puppet fail [12:52:08] PROBLEM - puppet last run on db1058 is CRITICAL puppet fail [12:52:08] PROBLEM - puppet last run on mw2203 is CRITICAL puppet fail [12:52:09] PROBLEM - puppet last run on db2047 is CRITICAL puppet fail [12:52:09] PROBLEM - puppet last run on mw2174 is CRITICAL puppet fail [12:52:10] PROBLEM - puppet last run on mw2200 is CRITICAL puppet fail [12:52:10] PROBLEM - puppet last run on mw2168 is CRITICAL puppet fail [12:52:11] PROBLEM - puppet last run on antimony is CRITICAL puppet fail [12:52:16] !log stop->wait->restart of apache2 service on palladium (seemed dead to puppet reqs) [12:52:16] PROBLEM - puppet last run on mw2039 is CRITICAL puppet fail [12:52:16] PROBLEM - puppet last run on mw1180 is CRITICAL puppet fail [12:52:17] PROBLEM - puppet last run on elastic1016 is CRITICAL puppet fail [12:52:17] PROBLEM - puppet last run on mw1213 is CRITICAL puppet fail [12:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:52:26] PROBLEM - puppet last run on mw2101 is CRITICAL puppet fail [12:52:27] PROBLEM - puppet last run on mw2085 is CRITICAL puppet fail [12:52:27] PROBLEM - puppet last run on mw2130 is CRITICAL puppet fail [12:52:27] PROBLEM - puppet last run on mw2111 is CRITICAL puppet fail [12:52:27] PROBLEM - puppet last run on mw2098 is CRITICAL puppet fail [12:52:27] PROBLEM - puppet last run on db2029 is CRITICAL puppet fail [12:52:27] PROBLEM - puppet last run on analytics1002 is CRITICAL puppet fail [12:52:28] PROBLEM - puppet last run on mira is CRITICAL puppet fail [12:52:28] PROBLEM - puppet last run on ms-be2009 is CRITICAL puppet fail [12:52:29] PROBLEM - puppet last run on mw2062 is CRITICAL puppet fail [12:52:29] PROBLEM - puppet last run on bromine is CRITICAL puppet fail [12:52:30] PROBLEM - puppet last run on db2007 is CRITICAL puppet fail [12:52:30] PROBLEM - puppet last run on db1024 is CRITICAL puppet fail [12:52:31] PROBLEM - puppet last run on db1016 is CRITICAL puppet fail [12:52:46] PROBLEM - puppet last run on mw1136 is CRITICAL puppet fail [12:52:46] PROBLEM - puppet last run on mw1169 is CRITICAL puppet fail [12:52:46] PROBLEM - puppet last run on mw1238 is CRITICAL puppet fail [12:52:47] PROBLEM - puppet last run on db2057 is CRITICAL puppet fail [12:52:47] PROBLEM - puppet last run on mc1012 is CRITICAL puppet fail [12:52:56] PROBLEM - puppet last run on mw1020 is CRITICAL puppet fail [12:52:57] PROBLEM - puppet last run on mw1179 is CRITICAL puppet fail [12:52:58] PROBLEM - puppet last run on db2023 is CRITICAL puppet fail [12:52:58] PROBLEM - puppet last run on ms-be2014 is CRITICAL puppet fail [12:52:58] PROBLEM - puppet last run on mc2014 is CRITICAL puppet fail [12:53:06] PROBLEM - puppet last run on ms-be2010 is CRITICAL puppet fail [12:53:06] PROBLEM - puppet last run on lvs4001 is CRITICAL puppet fail [12:53:07] PROBLEM - puppet last run on lvs3004 is CRITICAL puppet fail [12:53:07] PROBLEM - puppet last run on cp1099 is CRITICAL puppet fail [12:53:07] PROBLEM - puppet last run on ms-be3001 is CRITICAL puppet fail [12:53:17] PROBLEM - puppet last run on argon is CRITICAL puppet fail [12:53:17] PROBLEM - puppet last run on mw1049 is CRITICAL puppet fail [12:53:17] PROBLEM - puppet last run on cp1073 is CRITICAL puppet fail [12:53:26] PROBLEM - puppet last run on mw1138 is CRITICAL puppet fail [12:53:26] PROBLEM - puppet last run on analytics1029 is CRITICAL puppet fail [12:53:26] PROBLEM - puppet last run on mw1030 is CRITICAL puppet fail [12:53:27] PROBLEM - puppet last run on mw1083 is CRITICAL puppet fail [12:53:27] PROBLEM - puppet last run on cp3009 is CRITICAL puppet fail [12:53:27] PROBLEM - puppet last run on snapshot1002 is CRITICAL puppet fail [12:53:27] PROBLEM - puppet last run on ms-fe3002 is CRITICAL puppet fail [12:53:36] PROBLEM - puppet last run on cp4017 is CRITICAL puppet fail [12:53:37] PROBLEM - puppet last run on restbase1002 is CRITICAL puppet fail [12:53:37] PROBLEM - puppet last run on hafnium is CRITICAL puppet fail [12:53:37] PROBLEM - puppet last run on restbase1006 is CRITICAL puppet fail [12:53:47] PROBLEM - puppet last run on mw1084 is CRITICAL puppet fail [12:53:47] PROBLEM - puppet last run on analytics1022 is CRITICAL puppet fail [12:53:48] PROBLEM - puppet last run on cp1062 is CRITICAL puppet fail [12:53:48] PROBLEM - puppet last run on mw1181 is CRITICAL puppet fail [12:53:48] PROBLEM - puppet last run on mw1053 is CRITICAL puppet fail [12:53:56] PROBLEM - puppet last run on db1010 is CRITICAL puppet fail [12:53:56] PROBLEM - puppet last run on mc1008 is CRITICAL puppet fail [12:53:57] PROBLEM - puppet last run on analytics1019 is CRITICAL puppet fail [12:53:57] PROBLEM - puppet last run on ocg1002 is CRITICAL puppet fail [12:53:57] PROBLEM - puppet last run on stat1002 is CRITICAL puppet fail [12:54:06] PROBLEM - puppet last run on elastic1006 is CRITICAL puppet fail [12:54:07] PROBLEM - puppet last run on mw1165 is CRITICAL puppet fail [12:54:07] PROBLEM - puppet last run on db2041 is CRITICAL puppet fail [12:54:07] PROBLEM - puppet last run on cp2019 is CRITICAL puppet fail [12:54:08] PROBLEM - puppet last run on cp2015 is CRITICAL puppet fail [12:54:08] PROBLEM - puppet last run on db2037 is CRITICAL puppet fail [12:54:08] PROBLEM - puppet last run on mw2172 is CRITICAL puppet fail [12:54:08] PROBLEM - puppet last run on mw2150 is CRITICAL puppet fail [12:54:09] PROBLEM - puppet last run on cp2022 is CRITICAL puppet fail [12:54:16] PROBLEM - puppet last run on stat1001 is CRITICAL puppet fail [12:54:16] PROBLEM - puppet last run on mw2152 is CRITICAL puppet fail [12:54:17] PROBLEM - puppet last run on db1071 is CRITICAL puppet fail [12:54:17] PROBLEM - puppet last run on mw1132 is CRITICAL puppet fail [12:54:27] PROBLEM - puppet last run on mw2107 is CRITICAL puppet fail [12:54:27] PROBLEM - puppet last run on mw2133 is CRITICAL puppet fail [12:54:27] PROBLEM - puppet last run on mw2094 is CRITICAL puppet fail [12:54:27] PROBLEM - puppet last run on mw2027 is CRITICAL puppet fail [12:54:27] PROBLEM - puppet last run on mw2046 is CRITICAL puppet fail [12:54:27] PROBLEM - puppet last run on restbase1005 is CRITICAL puppet fail [12:54:28] PROBLEM - puppet last run on mw2053 is CRITICAL puppet fail [12:54:28] PROBLEM - puppet last run on db2004 is CRITICAL puppet fail [12:54:29] PROBLEM - puppet last run on mw2048 is CRITICAL puppet fail [12:54:37] PROBLEM - puppet last run on mw1074 is CRITICAL puppet fail [12:54:37] PROBLEM - puppet last run on ms-be2008 is CRITICAL puppet fail [12:54:38] PROBLEM - puppet last run on mw1240 is CRITICAL puppet fail [12:54:38] PROBLEM - puppet last run on es1009 is CRITICAL puppet fail [12:54:46] PROBLEM - puppet last run on ms-be1007 is CRITICAL puppet fail [12:54:46] PROBLEM - puppet last run on wtp1021 is CRITICAL puppet fail [12:54:47] PROBLEM - puppet last run on analytics1044 is CRITICAL puppet fail [12:54:47] PROBLEM - puppet last run on mw1218 is CRITICAL puppet fail [12:54:47] PROBLEM - puppet last run on db2061 is CRITICAL puppet fail [12:54:47] PROBLEM - puppet last run on mw1245 is CRITICAL puppet fail [12:54:56] PROBLEM - puppet last run on mw2202 is CRITICAL puppet fail [12:54:57] PROBLEM - puppet last run on lvs2003 is CRITICAL puppet fail [12:54:57] PROBLEM - puppet last run on mw2106 is CRITICAL puppet fail [12:54:58] PROBLEM - puppet last run on ms-be2013 is CRITICAL puppet fail [12:54:58] PROBLEM - puppet last run on db2001 is CRITICAL puppet fail [12:55:06] PROBLEM - puppet last run on db1062 is CRITICAL puppet fail [12:55:16] PROBLEM - puppet last run on mw1198 is CRITICAL puppet fail [12:55:16] PROBLEM - puppet last run on mw1040 is CRITICAL puppet fail [12:55:16] PROBLEM - puppet last run on db1055 is CRITICAL puppet fail [12:55:27] PROBLEM - puppet last run on cp3031 is CRITICAL puppet fail [12:55:27] PROBLEM - puppet last run on cp3013 is CRITICAL puppet fail [12:55:27] PROBLEM - puppet last run on cp3046 is CRITICAL puppet fail [12:55:27] PROBLEM - puppet last run on mw1134 is CRITICAL puppet fail [12:55:28] PROBLEM - puppet last run on analytics1032 is CRITICAL puppet fail [12:55:28] PROBLEM - puppet last run on mw1062 is CRITICAL puppet fail [12:55:36] PROBLEM - puppet last run on ms-be1009 is CRITICAL puppet fail [12:55:36] PROBLEM - puppet last run on lvs1001 is CRITICAL puppet fail [12:55:37] PROBLEM - puppet last run on wtp1002 is CRITICAL puppet fail [12:55:37] PROBLEM - puppet last run on wdqs1001 is CRITICAL puppet fail [12:55:37] PROBLEM - puppet last run on krypton is CRITICAL puppet fail [12:55:37] PROBLEM - puppet last run on protactinium is CRITICAL puppet fail [12:55:37] PROBLEM - puppet last run on mw1116 is CRITICAL puppet fail [12:55:47] PROBLEM - puppet last run on analytics1036 is CRITICAL puppet fail [12:55:47] PROBLEM - puppet last run on ms-fe1003 is CRITICAL puppet fail [12:55:47] PROBLEM - puppet last run on labvirt1005 is CRITICAL puppet fail [12:55:47] PROBLEM - puppet last run on labvirt1007 is CRITICAL puppet fail [12:55:47] PROBLEM - puppet last run on wdqs1002 is CRITICAL puppet fail [12:55:47] PROBLEM - puppet last run on labsdb1001 is CRITICAL puppet fail [12:55:57] PROBLEM - puppet last run on mc1011 is CRITICAL puppet fail [12:55:57] PROBLEM - puppet last run on restbase1003 is CRITICAL puppet fail [12:55:57] PROBLEM - puppet last run on db1063 is CRITICAL puppet fail [12:55:58] PROBLEM - puppet last run on cp1043 is CRITICAL puppet fail [12:56:06] PROBLEM - puppet last run on analytics1011 is CRITICAL puppet fail [12:56:07] PROBLEM - puppet last run on etcd1001 is CRITICAL puppet fail [12:56:07] PROBLEM - puppet last run on mw1031 is CRITICAL puppet fail [12:56:07] PROBLEM - puppet last run on cp2017 is CRITICAL puppet fail [12:56:17] PROBLEM - puppet last run on db2050 is CRITICAL puppet fail [12:56:17] PROBLEM - puppet last run on wtp2009 is CRITICAL puppet fail [12:56:17] PROBLEM - puppet last run on mw2193 is CRITICAL puppet fail [12:56:17] PROBLEM - puppet last run on mw2167 is CRITICAL puppet fail [12:56:17] PROBLEM - puppet last run on mw2186 is CRITICAL puppet fail [12:56:17] PROBLEM - puppet last run on ms-be1017 is CRITICAL puppet fail [12:56:17] PROBLEM - puppet last run on mw2209 is CRITICAL puppet fail [12:56:18] PROBLEM - puppet last run on ocg1001 is CRITICAL puppet fail [12:56:18] PROBLEM - puppet last run on wtp1013 is CRITICAL puppet fail [12:56:19] PROBLEM - puppet last run on mw1045 is CRITICAL puppet fail [12:56:19] PROBLEM - puppet last run on mw1256 is CRITICAL puppet fail [12:56:20] PROBLEM - puppet last run on mw1059 is CRITICAL puppet fail [12:56:36] PROBLEM - puppet last run on labcontrol1002 is CRITICAL puppet fail [12:56:36] PROBLEM - puppet last run on labstore2001 is CRITICAL puppet fail [12:56:36] PROBLEM - puppet last run on mw1022 is CRITICAL puppet fail [12:56:37] PROBLEM - puppet last run on lvs4002 is CRITICAL puppet fail [12:56:46] PROBLEM - puppet last run on ms-be1004 is CRITICAL puppet fail [12:56:47] PROBLEM - puppet last run on caesium is CRITICAL puppet fail [12:56:47] PROBLEM - puppet last run on mw1231 is CRITICAL puppet fail [12:56:56] PROBLEM - puppet last run on mw1072 is CRITICAL puppet fail [12:56:57] PROBLEM - puppet last run on mw1080 is CRITICAL puppet fail [12:56:57] PROBLEM - puppet last run on analytics1017 is CRITICAL puppet fail [12:56:57] PROBLEM - puppet last run on mw2144 is CRITICAL puppet fail [12:56:58] PROBLEM - puppet last run on mw1141 is CRITICAL puppet fail [12:57:06] PROBLEM - puppet last run on mw2032 is CRITICAL puppet fail [12:57:06] PROBLEM - puppet last run on mw2020 is CRITICAL puppet fail [12:57:06] PROBLEM - puppet last run on mw1178 is CRITICAL puppet fail [12:57:07] PROBLEM - puppet last run on ms-fe3001 is CRITICAL puppet fail [12:57:16] PROBLEM - puppet last run on uranium is CRITICAL puppet fail [12:57:16] PROBLEM - puppet last run on mw1001 is CRITICAL puppet fail [12:57:18] PROBLEM - puppet last run on mw1139 is CRITICAL puppet fail [12:57:26] PROBLEM - puppet last run on db1053 is CRITICAL puppet fail [12:57:27] PROBLEM - puppet last run on cp3019 is CRITICAL puppet fail [12:57:27] PROBLEM - puppet last run on cp3045 is CRITICAL puppet fail [12:57:27] PROBLEM - puppet last run on cp3039 is CRITICAL puppet fail [12:57:27] PROBLEM - puppet last run on elastic1001 is CRITICAL puppet fail [12:57:27] PROBLEM - puppet last run on mc1018 is CRITICAL puppet fail [12:57:27] PROBLEM - puppet last run on maerlant is CRITICAL puppet fail [12:57:28] PROBLEM - puppet last run on cp1055 is CRITICAL puppet fail [12:57:28] PROBLEM - puppet last run on analytics1041 is CRITICAL puppet fail [12:57:37] PROBLEM - puppet last run on wtp1010 is CRITICAL puppet fail [12:57:37] PROBLEM - puppet last run on cp3012 is CRITICAL puppet fail [12:57:38] PROBLEM - puppet last run on mw1222 is CRITICAL puppet fail [12:57:38] PROBLEM - puppet last run on netmon1001 is CRITICAL puppet fail [12:57:38] PROBLEM - puppet last run on graphite1002 is CRITICAL puppet fail [12:57:46] PROBLEM - puppet last run on db1011 is CRITICAL puppet fail [12:57:48] PROBLEM - puppet last run on lead is CRITICAL puppet fail [12:57:48] PROBLEM - puppet last run on ms-be1002 is CRITICAL puppet fail [12:57:48] PROBLEM - puppet last run on mw1215 is CRITICAL puppet fail [12:57:56] PROBLEM - puppet last run on analytics1028 is CRITICAL puppet fail [12:57:57] PROBLEM - puppet last run on db1070 is CRITICAL puppet fail [12:57:57] PROBLEM - puppet last run on logstash1005 is CRITICAL puppet fail [12:57:57] PROBLEM - puppet last run on wtp1020 is CRITICAL puppet fail [12:57:57] PROBLEM - puppet last run on mw1090 is CRITICAL puppet fail [12:58:07] PROBLEM - puppet last run on mw1226 is CRITICAL puppet fail [12:58:08] PROBLEM - puppet last run on mw1009 is CRITICAL puppet fail [12:58:08] PROBLEM - puppet last run on db1073 is CRITICAL puppet fail [12:58:08] PROBLEM - puppet last run on analytics1020 is CRITICAL puppet fail [12:58:08] PROBLEM - puppet last run on mw1219 is CRITICAL puppet fail [12:58:17] PROBLEM - puppet last run on mw1060 is CRITICAL puppet fail [12:58:17] PROBLEM - puppet last run on mw2157 is CRITICAL puppet fail [12:58:17] PROBLEM - puppet last run on db2034 is CRITICAL puppet fail [12:58:18] PROBLEM - puppet last run on mw2141 is CRITICAL puppet fail [12:58:18] PROBLEM - puppet last run on mw2145 is CRITICAL puppet fail [12:58:18] PROBLEM - puppet last run on mw2081 is CRITICAL puppet fail [12:58:18] PROBLEM - puppet last run on terbium is CRITICAL puppet fail [12:58:18] PROBLEM - puppet last run on mw2076 is CRITICAL puppet fail [12:58:19] PROBLEM - puppet last run on sca1002 is CRITICAL puppet fail [12:58:19] PROBLEM - puppet last run on mc1002 is CRITICAL puppet fail [12:58:26] PROBLEM - puppet last run on ms-be1003 is CRITICAL puppet fail [12:58:27] PROBLEM - puppet last run on mw1093 is CRITICAL puppet fail [12:58:27] PROBLEM - puppet last run on mw1026 is CRITICAL puppet fail [12:58:27] PROBLEM - puppet last run on iridium is CRITICAL puppet fail [12:58:37] PROBLEM - puppet last run on es2009 is CRITICAL puppet fail [12:58:37] PROBLEM - puppet last run on mw2069 is CRITICAL puppet fail [12:58:37] PROBLEM - puppet last run on mw2021 is CRITICAL puppet fail [12:58:37] PROBLEM - puppet last run on wtp1006 is CRITICAL puppet fail [12:58:37] PROBLEM - puppet last run on es2002 is CRITICAL puppet fail [12:58:37] PROBLEM - puppet last run on db2002 is CRITICAL puppet fail [12:58:37] PROBLEM - puppet last run on mw2033 is CRITICAL puppet fail [12:58:38] PROBLEM - puppet last run on mw2036 is CRITICAL puppet fail [12:58:38] PROBLEM - puppet last run on mw2016 is CRITICAL puppet fail [12:58:39] PROBLEM - puppet last run on mc2004 is CRITICAL puppet fail [12:58:39] PROBLEM - puppet last run on mw2023 is CRITICAL puppet fail [12:58:40] PROBLEM - puppet last run on cp1054 is CRITICAL puppet fail [12:59:06] !log stopped ircecho + puppet-agent on neon (spam from epic puppetmaster fail) [12:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:20] 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1502816 (10ArielGlenn) 3NEW a:3ArielGlenn [13:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150803T1300). Please do the needful. [13:05:02] !log stopped puppet-agent + apache2 on strontium + palladium (no masters alive, for mysql maintenance) [13:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:10:13] !log rhodium too (puppetmaster stop) [13:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:42] !log re-enable agent, restarted apache2 on palladium, strontium, rhodium (fact_values truncated in mysql) [13:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:11] (03PS1) 10Aude: Enable usage tracking on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228808 [13:40:29] 7Puppet, 6operations, 7Database: Puppet failure on all hosts with Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Mysql::Error: Out of range value for column 'id' at row 1: INSERT INTO `fact_values` (`updated_at`, `host_id`, `creat... - https://phabricator.wikimedia.org/T107753#1502869 [13:41:39] ^bblack _joe_ I will assign it to me and close it if I do not think it will happen often [13:41:54] <_joe_> jynus: nod [13:42:32] 7Puppet, 6operations, 7Database: Puppet failure on all hosts with Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Mysql::Error: Out of range value for column 'id' at row 1: INSERT INTO `fact_values` (`updated_at`, `host_id`, `creat... - https://phabricator.wikimedia.org/T107753#1502880 [13:42:48] 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1502881 (10ArielGlenn) Things that need to be done for this to happen: make sure we can skip some jobs on a particular run and still call the dump 'complete' (lets us run dumps without full content for all rev... [13:43:22] probably the important thing here is to create a monitoring service for all auto_increments on all masters [13:43:54] (03CR) 10BBlack: [C: 032] "Testing on live text+upload seems ok, and it actually seems to be a probable improvement on upload user-facing perf rather than a downside" [puppet] - 10https://gerrit.wikimedia.org/r/228564 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [13:50:18] !log re-enabling puppet + ircecho on neon (vast majority of recovery spam is over with) [13:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:51:37] PROBLEM - puppet last run on neon is CRITICAL puppet fail [13:53:37] <_joe_> keep an eye on neon's puppet [13:53:46] that's just from me having it disabled forever [13:53:52] I'm re-running it now to clear the alert [13:53:52] <_joe_> it's the first candidate to fail if something is screwed in the db [13:55:38] zero icinga config changes from that run, so all good :) [13:55:48] <_joe_> yes [13:55:57] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [13:55:57] (03PS1) 10ArielGlenn: schedule stages of dumps to run in order on a given host [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228809 [14:00:25] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1502905 (10Dzahn) and since all ULSFO was done, Giuseppe has merged my changed to also switch ULSFO to default: https://gerrit.wikimedia.org/r/#/c/225276/ and could then ren... [14:00:33] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1502907 (10Dzahn) 5Open>3Resolved [14:01:25] 6operations, 7Easy: Update people.wikimedia.org with the 2015 Wikimania group photo - https://phabricator.wikimedia.org/T106598#1502910 (10Dzahn) a:3Dzahn [14:03:30] (03PS1) 10BBlack: enable upload ipsec for eqiad + cp3034 for upload-reload testing [puppet] - 10https://gerrit.wikimedia.org/r/228811 (https://phabricator.wikimedia.org/T92604) [14:05:49] 6operations, 7HTTPS: download.wikipedia.org is using an invalid certificate - https://phabricator.wikimedia.org/T107575#1502936 (10Dzahn) It was never done because there were concerns by @apergos about putting large (huge, GBs) files behind varnish. [14:06:50] (03CR) 10BBlack: [C: 032] enable upload ipsec for eqiad + cp3034 for upload-reload testing [puppet] - 10https://gerrit.wikimedia.org/r/228811 (https://phabricator.wikimedia.org/T92604) (owner: 10BBlack) [14:07:44] 6operations, 6Performance-Team, 7Mobile, 3Reading-Web: Remove docroot:/images/mobile in favour of docroot:/static/images/mobile - https://phabricator.wikimedia.org/T107395#1502942 (10Dzahn) [14:13:11] (03CR) 10Dzahn: "since there was quite some discussion here about the naming and a meeting took place with bblack/paravoid and FR it should get a +1 from o" [dns] - 10https://gerrit.wikimedia.org/r/227705 (https://phabricator.wikimedia.org/T107060) (owner: 10John F. Lewis) [14:13:25] 6operations: staged dumps tracking task - https://phabricator.wikimedia.org/T107757#1502963 (10ArielGlenn) 3NEW a:3ArielGlenn [14:15:55] 6operations: allow dumps to be treated as 'done' even though some steps are skipped - https://phabricator.wikimedia.org/T107758#1502972 (10ArielGlenn) 3NEW a:3ArielGlenn [14:15:58] (03CR) 10Dzahn: "i really don't know about the status of nutcracker, maybe ori could comment on this?" [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [14:17:05] (03CR) 10Aude: [C: 032] Enable usage tracking on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228808 (owner: 10Aude) [14:17:09] 6operations: staged dumps tracking task - https://phabricator.wikimedia.org/T107757#1502985 (10ArielGlenn) [14:17:10] 6operations: allow dumps to be treated as 'done' even though some steps are skipped - https://phabricator.wikimedia.org/T107758#1502986 (10ArielGlenn) [14:17:11] (03Merged) 10jenkins-bot: Enable usage tracking on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228808 (owner: 10Aude) [14:18:19] !log aude Synchronized usagetracking.dblist: Enable usage tracking on svwiki (duration: 00m 12s) [14:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:45] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:18:45] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:18:55] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:19:06] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:19:11] 6operations: worker bash script terminates early when there are still more wikis to run - https://phabricator.wikimedia.org/T107759#1502993 (10ArielGlenn) 3NEW [14:19:15] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:19:25] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:19:25] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:19:25] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:19:25] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:19:44] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:19:44] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:19:48] 6operations: worker bash script terminates early when there are still more wikis to run - https://phabricator.wikimedia.org/T107759#1503000 (10ArielGlenn) a:3ArielGlenn [14:20:05] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:20:25] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, cp4006_ [14:20:25] 6operations: worker bash script terminates early when there are still more wikis to run - https://phabricator.wikimedia.org/T107759#1502993 (10ArielGlenn) [14:20:26] 6operations: staged dumps tracking task - https://phabricator.wikimedia.org/T107757#1503002 (10ArielGlenn) [14:20:44] 6operations: staged dumps tracking task - https://phabricator.wikimedia.org/T107757#1502963 (10ArielGlenn) [14:20:46] 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1503004 (10ArielGlenn) [14:21:35] PROBLEM - Host analytics1043 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:36] 6operations: kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756 for analytics1044 and analytics1043 - https://phabricator.wikimedia.org/T107698#1503007 (10Ottomata) OO, nasty. Googled, found these: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1315736 http://androidspanner.blogspot.com/2014/12/ke... [14:21:39] 6operations: kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756 for analytics1044 and analytics1043 - https://phabricator.wikimedia.org/T107698#1503008 (10jcrespo) a:3Ottomata [14:21:48] oh that's me, an43, sorry [14:21:54] !log upgrading kernel on analytics1042-1049 from 3.13.0.24.28 to 3.13.0.61.68 because T107698 [14:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:45] RECOVERY - Host analytics1043 is UPING OK - Packet loss = 0%, RTA = 0.92 ms [14:23:47] !log updated iojs on apt.wikimedia.org to 2.5.0 for jessie-wikimedia [14:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:04] 6operations: need script that handles all bash worker scripts on a given snapshot, per stage, rerunning failures as appropriate, managing resources as appropriate - https://phabricator.wikimedia.org/T107760#1503010 (10ArielGlenn) 3NEW a:3ArielGlenn [14:24:27] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1503018 (10MoritzMuehlenhoff) I have imported 2.5.0 for jessie-wikimedia (iojs and iojs-dbg) [14:24:30] 6operations: need script that handles all bash worker scripts on a given snapshot, per stage, rerunning failures as appropriate, managing resources as appropriate - https://phabricator.wikimedia.org/T107760#1503021 (10ArielGlenn) [14:24:32] 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1503020 (10ArielGlenn) [14:26:33] 6operations: need script that handles all bash worker scripts on a given snapshot, per stage, rerunning failures as appropriate, managing resources as appropriate - https://phabricator.wikimedia.org/T107760#1503030 (10ArielGlenn) https://gerrit.wikimedia.org/r/228809 untested are 'rerun' and email notification. [14:26:45] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1503031 (10BBlack) I'm generally agreeable to the plan, but I'm still a bit worried about the timeline here. We don't have firm deadlines, but presenting a plan on Oct 7 doesn't mean being ready... [14:27:07] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1503033 (10Andrew) Now, how do we switch to actually using this box? One option is https://phabricator.wikimedia.org/T107731 Another option is to try to do some... [14:27:17] !log upgrading junos on cr1-codfw [14:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:29] (03CR) 10Ottomata: [C: 031] Add ferm rules for Hue server [puppet] - 10https://gerrit.wikimedia.org/r/228788 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [14:28:16] ACKNOWLEDGEMENT - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:16] ACKNOWLEDGEMENT - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:16] ACKNOWLEDGEMENT - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:16] ACKNOWLEDGEMENT - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:16] ACKNOWLEDGEMENT - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:17] ACKNOWLEDGEMENT - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:17] ACKNOWLEDGEMENT - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:18] ACKNOWLEDGEMENT - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:18] ACKNOWLEDGEMENT - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:19] ACKNOWLEDGEMENT - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:19] ACKNOWLEDGEMENT - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:20] ACKNOWLEDGEMENT - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:20] ACKNOWLEDGEMENT - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 2 connecting: cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6, cp4005_v4, cp4005_v6, [14:28:49] (03PS2) 10Dzahn: Add a separate Hiera source for dumps mirrors only reachable via IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/228217 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [14:28:54] sorry for the spam, I got lost in phabricator when I was supposed to be watching to catch those! [14:29:14] RECOVERY - Disk space on analytics1015 is OK: DISK OK [14:29:14] RECOVERY - Disk space on Hadoop worker on analytics1015 is OK: DISK OK [14:29:14] RECOVERY - salt-minion processes on analytics1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:29:46] (03CR) 10Dzahn: [C: 032] Add a separate Hiera source for dumps mirrors only reachable via IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/228217 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [14:29:55] RECOVERY - RAID on analytics1015 is OK no disks configured for RAID [14:30:06] RECOVERY - SSH on analytics1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [14:30:44] RECOVERY - dhclient process on analytics1015 is OK: PROCS OK: 0 processes with command name dhclient [14:30:45] RECOVERY - configured eth on analytics1015 is OK - interfaces up [14:30:45] RECOVERY - DPKG on analytics1015 is OK: All packages OK [14:30:56] (03CR) 10Ottomata: [C: 031] Add ferm rules for Hive server/metastore [puppet] - 10https://gerrit.wikimedia.org/r/228791 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [14:31:16] (03CR) 10Ottomata: [C: 031] Add ferm rules for Oozie HTTP interface [puppet] - 10https://gerrit.wikimedia.org/r/228792 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [14:31:44] (03CR) 10Ottomata: [C: 031] add ferm rules for udp2log [puppet] - 10https://gerrit.wikimedia.org/r/227719 (owner: 10Muehlenhoff) [14:31:48] 7Puppet, 6operations, 7Database: Puppet failure on all hosts with Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Mysql::Error: Out of range value for column 'id' at row 1: INSERT INTO `fact_values` (`updated_at`, `host_id`, `creat... - https://phabricator.wikimedia.org/T107753#1503064 [14:32:13] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1503065 (10BBlack) 5Open>3Resolved [14:33:16] !log temp. stop puppet on dataset1001 [14:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:37] (03PS1) 10Muehlenhoff: Fix the dumps list for ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/228818 [14:36:08] (03CR) 10ArielGlenn: [C: 031] Fix the dumps list for ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/228818 (owner: 10Muehlenhoff) [14:36:52] (03CR) 10Dzahn: Add a separate Hiera source for dumps mirrors only reachable via IPv6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/228217 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [14:37:33] (03CR) 10Dzahn: [C: 032] Fix the dumps list for ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/228818 (owner: 10Muehlenhoff) [14:40:04] (03PS1) 10Aude: Enable usage tracking on thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228820 [14:40:42] (03CR) 10Aude: [C: 032] Enable usage tracking on thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228820 (owner: 10Aude) [14:40:48] (03Merged) 10jenkins-bot: Enable usage tracking on thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228820 (owner: 10Aude) [14:41:03] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1503086 (10Andrew) > I've no idea how to make that happen though. Well, that's not exactly true. It is /probably/ as simple as changing templates/wmnet:eth4-1102.... [14:41:15] (03PS4) 10BBlack: add benefactorevents & eventsdonations CNAMEs for Major Gift [dns] - 10https://gerrit.wikimedia.org/r/227705 (https://phabricator.wikimedia.org/T107060) (owner: 10John F. Lewis) [14:42:00] (03CR) 10BBlack: [C: 032] "No point holding this up at this point, regardless of deeper disagreements about it." [dns] - 10https://gerrit.wikimedia.org/r/227705 (https://phabricator.wikimedia.org/T107060) (owner: 10John F. Lewis) [14:42:01] !log aude Synchronized usagetracking.dblist: Enable usage tracking on thwiki (duration: 00m 12s) [14:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:37] (03PS2) 10Dzahn: Add ferm rules for Oozie HTTP interface [puppet] - 10https://gerrit.wikimedia.org/r/228792 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [14:47:14] (03CR) 10Dzahn: [C: 032] Add ferm rules for Oozie HTTP interface [puppet] - 10https://gerrit.wikimedia.org/r/228792 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [14:47:16] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1503102 (10BBlack) 5Open>3Resolved a:3BBlack [14:47:33] (03CR) 10Alexandros Kosiaris: "Happy that this worked out fine. Clean merge was not expected anyway on top of the 0.8.2.1 tag. It rather required a bit of an effort to c" [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227988 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [14:49:08] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1503105 (10BBlack) Getting back onto the technical tasks at hand: the DNS part is merged now, and AIUI we (ops) need to purchase a cert from globalsign (RSA, no wildcard/SAN, single hostname for "... [14:49:11] akosiaris: u no like debian branch name? [14:49:34] it seemed like what every other gbp repo we had was doing, and was less confusing for me [14:49:46] i know gbp defaults to master, but usually we don't use master because most upstreams already have a master branch [14:49:58] just for whatever (svn) historical reasons kafka upstream has trunk instead [14:50:33] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1503117 (10Andrew) whoah, except why is that eth4-1102 when the actual interface is eth1.1102@eth1? [14:52:15] (03CR) 10Dzahn: [C: 031] add ferm rules for udp2log [puppet] - 10https://gerrit.wikimedia.org/r/227719 (owner: 10Muehlenhoff) [14:54:13] !log krenair Synchronized php-1.26wmf16/extensions/SemanticResultFormats: https://gerrit.wikimedia.org/r/#/c/228793/ (duration: 00m 13s) [14:54:14] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1503123 (10BBlack) @EWilfong_WMF - do you have a way for us to hand off the private key to you securely? (e.g. a PGP public key) [14:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:23] (03PS2) 10Dzahn: Add ferm rules for Hue server [puppet] - 10https://gerrit.wikimedia.org/r/228788 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [14:54:37] (tested, working) [14:54:45] ((only affects a very obscure syntax on wikitech anyway)) [14:55:16] (03CR) 10Dzahn: [C: 032] "@analytics1027:/etc/hue# grep -r 8888 *" [puppet] - 10https://gerrit.wikimedia.org/r/228788 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [14:55:27] ottomata: well, my point is that debian will be problematic again in 0.8.X.Y [14:55:37] not with "debian" itself [14:56:11] as in, we will have to checkout the tag again and reapply on top of it, no ? [14:56:23] ah [14:56:26] i see [14:56:54] right, but i'm not sure why i had so much trouble, I would think that it would be easy to merge or rebase if you were just applying future commits, no? [14:57:24] I am a bit perplexed by that too. I did expect some merge problems but not what you described [14:57:31] (03PS1) 10Aude: Enable usage tracking on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228824 [14:57:33] and supposedly --theirs should have helped a lot [14:57:53] Krenair: can i deploy one more thing before swat [14:57:56] config change [14:58:15] aude, I've just logged off, sounds fine to me but I'm not greg-g :) [14:58:18] ok [14:58:20] and I'm not doing swat today [14:58:27] actually supposed to be on holiday [14:58:28] it's my window now, for 2 more minutes :) [14:58:31] heh [14:58:38] ottomata: merge, not rebase though. No rebasing in public repos ;-) [14:59:11] (03CR) 10Aude: [C: 032] Enable usage tracking on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228824 (owner: 10Aude) [14:59:15] (03Merged) 10jenkins-bot: Enable usage tracking on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228824 (owner: 10Aude) [14:59:50] !log aude Synchronized usagetracking.dblist: Enable usage tracking on trwiki (duration: 00m 12s) [14:59:54] ok, done for now [14:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150803T1500). Please do the needful. [15:00:04] gilles Krenair: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:19] 14:41:45 connection spike on s3 [15:00:34] *14:42:45 [15:00:34] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1503135 (10EWilfong_WMF) Sure. Here's my public PGP key: ``` -----BEGIN PGP PUBLIC KEY BLOCK----- Version: Encryption Desktop 10.3.2 (Build 15917) - not licensed for commercial use: www.pgp.com... [15:00:55] I can SWAT this morning: gilles ping for SWAT; Krenair looks like your is done? [15:01:04] thcipriani: pong [15:01:48] (03PS1) 10Ottomata: Preparing to reinstall and expand Kafka cluster on Jessie at Kafka 0.8.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/228826 (https://phabricator.wikimedia.org/T106581) [15:02:04] (03PS2) 10Ottomata: Preparing to reinstall and expand Kafka cluster on Jessie at Kafka 0.8.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/228826 (https://phabricator.wikimedia.org/T106581) [15:03:33] (03CR) 10Ottomata: [C: 032 V: 032] Debianize 0.8.2.1 tag [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227989 (owner: 10Ottomata) [15:04:28] (03CR) 10Muehlenhoff: "The ipv6-only mirrors are properly handled now, could you please re-review?" [puppet] - 10https://gerrit.wikimedia.org/r/227712 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [15:04:40] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [15:05:38] (03CR) 10Dzahn: [C: 031] "works on ms1001" [puppet] - 10https://gerrit.wikimedia.org/r/227712 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [15:05:44] (03PS3) 10Dzahn: Enable base::firewall on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/227712 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [15:06:08] gilles: it looks like for wmf16 submodule updates are autotracked by the branch so https://gerrit.wikimedia.org/r/#/c/228219 is superfluous, I'll sync out the bump here in a second. [15:06:22] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1503143 (10GWicke) 5Open>3Resolved a:3GWicke @moritzmuehlenhoff: Thanks again, this is great! [15:06:50] thcipriani: does that mean submodule bumps aren't needed anymore for that extension? [15:07:16] (03CR) 10Dzahn: [C: 032] "also a good time to do this per apergos, only 1 job running now" [puppet] - 10https://gerrit.wikimedia.org/r/227712 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [15:07:54] gilles: that's what happens when you +2 in extension deployment branch [15:07:58] now* [15:08:08] ah, so it is a new thing [15:08:10] gilles: it depends on how the branch is cut and whether .gitmodules tracks branches what gerrit will do. in this instance https://gerrit.wikimedia.org/r/#/c/228218/ was all that was needed [15:08:11] TIL, thanks [15:08:14] (it also gets automatically merged, which is not nice imho) [15:08:31] so your changes are on tin (since friday) but not deployed yet [15:08:55] is that how things are going to be done going forward or some branch cuts will be like this and others not? [15:09:13] i think it will be like this, except maybe not the automatic merge of submodule bumps [15:09:36] (03PS2) 10Muehlenhoff: add ferm rules for udp2log [puppet] - 10https://gerrit.wikimedia.org/r/227719 [15:10:43] !log thcipriani Synchronized php-1.26wmf16/extensions/MultimediaViewer: SWAT: Track image load time with statsv [[gerrit:228218]] (duration: 00m 12s) [15:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:55] ^ gilles should be sync'd now, please check [15:11:21] PROBLEM - puppet last run on labcontrol2001 is CRITICAL Puppet has 1 failures [15:11:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] add ferm rules for udp2log [puppet] - 10https://gerrit.wikimedia.org/r/227719 (owner: 10Muehlenhoff) [15:13:57] thcipriani: I don't see the new code yet [15:14:53] i wouldn't have done git submodule update ... [15:15:09] gilles: with ?debug=true ? [15:15:16] thcipriani: yep [15:15:28] thcipriani: did you do git submodule update ... ? [15:15:32] https://www.dropbox.com/s/pfcob79rnzen1jy/Screenshot%202015-08-03%2017.15.28.png?dl=0 [15:16:15] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1097544 (10GWicke) [15:16:27] aude: yes: git submodule update --rebase extensions/MultimediaViewer inside php1.26wmf16 [15:16:36] ok [15:17:19] 6operations, 7network: investigate ethernet errors: asw2-a5-eqiad port xe-0/0/36 - https://phabricator.wikimedia.org/T107635#1503171 (10BBlack) I've disabled the link (1/4 from aggregate) on both sides: ``` {master:0}[edit] bblack@asw2-a5-eqiad# show|compare [edit interfaces xe-0/0/36] + disable; {master:0... [15:17:59] gilles: might have something to do with this: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#A_note_on_JavaScript_and_CSS checking to see if the code made it to mw1017 now [15:18:46] except here ?debug=1 doesn't serve the new content [15:19:36] (03PS3) 10Ottomata: Preparing to reinstall and expand Kafka cluster on Jessie at Kafka 0.8.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/228826 (https://phabricator.wikimedia.org/T106581) [15:20:01] so the new js is on mw1017 which means it likely made it out everywhere [15:20:10] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:20:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add ferm rules for Logstash/Elasticsearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [15:20:15] maybe touch the files? [15:20:39] (03CR) 10Ottomata: [C: 032] Preparing to reinstall and expand Kafka cluster on Jessie at Kafka 0.8.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/228826 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [15:21:05] aude: touch and resync? [15:21:39] 1yup [15:21:41] -1 [15:22:28] !log reinstalling analytics1013,1014 and 1020 with Jessie [15:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:30] !log thcipriani Synchronized php-1.26wmf16/extensions/MultimediaViewer: SWAT: Track image load time with statsv (touch and re-sync) [[gerrit:228218]] (duration: 00m 12s) [15:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:36] ^ touched and re-synced [15:23:51] (03PS1) 10Ottomata: Removing analytics1013,1014 and 1018 from hadoop worker list in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/228832 (https://phabricator.wikimedia.org/T106581) [15:24:13] (03CR) 10Ottomata: [C: 032 V: 032] Removing analytics1013,1014 and 1018 from hadoop worker list in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/228832 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [15:24:34] still not seeing the new code :( [15:25:03] (03PS1) 10Dzahn: Revert "Enable base::firewall on dataset1001" [puppet] - 10https://gerrit.wikimedia.org/r/228833 [15:25:17] to have a look yourself, open https://en.wikipedia.org/wiki/Barack_Obama?debug=1#/media/File:President_Barack_Obama.jpg and in the console type mw.mmv.logging.PerformanceLogger.prototype.log [15:25:27] to get the link straight to the function definition [15:26:04] (03PS2) 10Dzahn: Revert "Enable base::firewall on dataset1001" [puppet] - 10https://gerrit.wikimedia.org/r/228833 [15:26:34] i think i see the code [15:26:39] * aude looks at the patch [15:26:44] (03CR) 10Dzahn: [C: 032] Revert "Enable base::firewall on dataset1001" [puppet] - 10https://gerrit.wikimedia.org/r/228833 (owner: 10Dzahn) [15:27:10] I still see https://www.dropbox.com/s/y0dfzrueb0flvi4/Screenshot%202015-08-03%2017.26.59.png?dl=0 [15:27:26] nope, don't see [15:27:44] so the new code shows when you open the file directly: https://en.wikipedia.org/static/1.26wmf16/extensions/MultimediaViewer/resources/mmv/logging/mmv.logging.PerformanceLogger.js (seeing // Track thumbnail load time with statsv, unsampled) [15:28:33] thcipriani: not for me [15:29:35] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: IPSec: roll-out plan - https://phabricator.wikimedia.org/T92604#1503209 (10BBlack) cp3034 (upload esams) has had all its backhaul to eqiad over ipsec now for ~1h. CPU impact is, again, virtually non-existent. Attempting wipe-test now. [15:29:51] ah, might be because I'm sending the X-Wikimedia-Debug: 1 header :\ [15:30:21] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 16.67% of data above the critical threshold [500.0] [15:30:40] thcipriani: yep, with the header the new code is there [15:33:18] thcipriani: I've just graphed the data, it looks like it's working (bottom graph) https://grafana.wikimedia.org/#/dashboard/db/media [15:34:24] !log wiping cp3034 disk cache (upload esams) for ipsec reload testing [15:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:52] gilles: wow, yeah, that's good. Still strange that the unminified version is cached (has expire headers set for Sept) seemingly ResourceLoader is smart enough to work through this sort of thing, though. [15:36:40] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:39:49] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1503232 (10GWicke) Since switching back to JDK8, GC timings and latencies have been back to normal: {F342426} {F342439} [15:40:00] (03PS1) 10Muehlenhoff: Add missing ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/228837 (https://bugzilla.wikimedia.org/105040) [15:40:31] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:41:02] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: IPSec: roll-out plan - https://phabricator.wikimedia.org/T92604#1503240 (10BBlack) Wipe-test success. The cpu bump in iowait% from rewriting the cache (which is also minor) is easier to see than any from the related extra crypto. [15:42:41] (03PS1) 10BBlack: enable ipsec for all upload caches [puppet] - 10https://gerrit.wikimedia.org/r/228838 (https://phabricator.wikimedia.org/T92604) [15:44:01] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1503250 (10GWicke) @fgiunchedi, @eevans: Should we start the bootstrap for 1009 today? [15:44:08] (03CR) 10BBlack: [C: 032] enable ipsec for all upload caches [puppet] - 10https://gerrit.wikimedia.org/r/228838 (https://phabricator.wikimedia.org/T92604) (owner: 10BBlack) [15:45:38] there will be some unavoidable channelspam in the future of IPSec icinga recoveries (of acked criticals, from completing the deployment) [15:45:48] the near future, I should have said [15:47:53] 6operations: move some wikis from small to big dumps config - https://phabricator.wikimedia.org/T107767#1503269 (10ArielGlenn) 3NEW a:3ArielGlenn [15:48:02] 6operations: move some wikis from small to big dumps config - https://phabricator.wikimedia.org/T107767#1503278 (10ArielGlenn) [15:48:04] 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1503277 (10ArielGlenn) [15:53:11] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 44 ESP OK [15:53:11] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 44 ESP OK [15:53:21] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 44 ESP OK [15:53:38] 6operations, 10ops-eqiad, 7network: investigate ethernet errors: asw2-a5-eqiad port xe-0/0/36 - https://phabricator.wikimedia.org/T107635#1503281 (10faidon) [15:53:41] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 44 ESP OK [15:53:52] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 44 ESP OK [15:53:52] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 44 ESP OK [15:54:06] (03PS1) 10Milimetric: Enable the mobile data crunching job [puppet] - 10https://gerrit.wikimedia.org/r/228840 (https://phabricator.wikimedia.org/T104379) [15:54:11] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 44 ESP OK [15:54:11] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 44 ESP OK [15:54:12] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 44 ESP OK [15:54:12] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 44 ESP OK [15:54:12] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 44 ESP OK [15:54:21] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 44 ESP OK [15:54:41] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 44 ESP OK [15:55:13] (03CR) 10Ottomata: [C: 032] Enable the mobile data crunching job [puppet] - 10https://gerrit.wikimedia.org/r/228840 (https://phabricator.wikimedia.org/T104379) (owner: 10Milimetric) [15:57:00] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: IPSec: roll-out plan - https://phabricator.wikimedia.org/T92604#1503288 (10BBlack) 5Open>3Resolved ipsec is active for all of the primary clusters for cache<->cache from tier2 to tier1: text, upload, mobile, bits. bits doesn't technically protect a... [15:59:01] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [16:00:05] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150803T1600). Please do the needful. [16:07:02] 6operations, 10Analytics-Cluster, 5Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1503299 (10BBlack) [16:07:23] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1503300 (10Andrew) More info from code-diving: When associating a new instance with a network in flatdhcp mode, it just grabs the list of networks and associates th... [16:09:50] 6operations, 10Wikimedia-Mailing-lists: recent e-mails missing from pywikibot archive - https://phabricator.wikimedia.org/T107769#1503323 (10JohnLewis) p:5Triage>3High [16:12:11] 6operations, 10Wikimedia-Mailing-lists: recent e-mails missing from pywikibot archive - https://phabricator.wikimedia.org/T107769#1503330 (10JohnLewis) @robh: Can you check to see if the logs have anything interesting in them? (there should be a general or archive-related log.) I feel this might be an issue se... [16:13:12] I keep getting intermittent (wrong word?) 503s on Commons... [16:13:17] Request: GET http://commons.wikimedia.org/w/index.php?title=File:Clouds.JPG&curid=401445&action=history, from 10.20.0.106 via cp1055 cp1055 ([10.64.32.107]:3128), Varnish XID 711934244 - Forwarded for: 81.236.232.15, 10.20.0.176, 10.20.0.176, 10.20.0.106 - Error: 503, Service Unavailable at Mon, 03 Aug 2015 16:12:16 GMT [16:13:28] (nice new splash/crash-screen though :) [16:15:17] Anyone know the reasons for these 503s? [16:15:22] not yet [16:16:16] ok :/ [16:17:57] so far I can't reproduce that, using my own logged-in session [16:18:09] how many times did it happen, or how often? [16:22:40] (03CR) 10Yuvipanda: nrpe: Merge check_systemd_unit_lastrun into _state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/228329 (owner: 10coren) [16:25:20] What was up with the dataset hosts earlier? NFS failure? [16:25:41] I think I might need to re-create the Wikidata dump, could be incomplete [16:27:25] !log upgrading junos on cr2-codfw [16:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:14] (03PS1) 10Ottomata: Provisioning analytics1013 as Kafka broker in analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/228847 (https://phabricator.wikimedia.org/T106581) [16:28:37] hoo: incorrect firewall rules, sorry for the inconvenience [16:29:09] bblack: like 5 times this hour [16:30:11] (03CR) 10Alexandros Kosiaris: [C: 032] new_wmf_service: preserve order in yaml output [puppet] - 10https://gerrit.wikimedia.org/r/227963 (owner: 10Giuseppe Lavagetto) [16:30:20] (03PS2) 10Alexandros Kosiaris: new_wmf_service: preserve order in yaml output [puppet] - 10https://gerrit.wikimedia.org/r/227963 (owner: 10Giuseppe Lavagetto) [16:30:27] (03CR) 10Alexandros Kosiaris: [V: 032] new_wmf_service: preserve order in yaml output [puppet] - 10https://gerrit.wikimedia.org/r/227963 (owner: 10Giuseppe Lavagetto) [16:31:31] (03PS2) 10Ottomata: Provisioning analytics1013 as Kafka broker in analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/228847 (https://phabricator.wikimedia.org/T106581) [16:31:41] (03CR) 10Alexandros Kosiaris: [C: 032] new_wmf_service: use a slightly less ugly anchor/alias template [puppet] - 10https://gerrit.wikimedia.org/r/227964 (owner: 10Giuseppe Lavagetto) [16:31:48] (03PS2) 10Alexandros Kosiaris: new_wmf_service: use a slightly less ugly anchor/alias template [puppet] - 10https://gerrit.wikimedia.org/r/227964 (owner: 10Giuseppe Lavagetto) [16:31:54] (03CR) 10Alexandros Kosiaris: [V: 032] new_wmf_service: use a slightly less ugly anchor/alias template [puppet] - 10https://gerrit.wikimedia.org/r/227964 (owner: 10Giuseppe Lavagetto) [16:33:09] (03PS3) 10Ottomata: Provisioning analytics1013 as Kafka broker in analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/228847 (https://phabricator.wikimedia.org/T106581) [16:33:24] (03CR) 10Ottomata: [C: 032] Provisioning analytics1013 as Kafka broker in analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/228847 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [16:33:34] (03CR) 10Ottomata: [V: 032] Provisioning analytics1013 as Kafka broker in analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/228847 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [16:33:50] akosiaris: merge ok? [16:34:17] (03CR) 10Alexandros Kosiaris: [C: 032] new_wmf_service: add conftool service config as well [puppet] - 10https://gerrit.wikimedia.org/r/227965 (owner: 10Giuseppe Lavagetto) [16:34:29] ottomata: yeah sorry [16:34:39] I was meaning to merge all 3 together [16:34:40] np, merged. [16:34:41] PROBLEM - Host cr2-codfw is DOWN: CRITICAL - Network Unreachable (208.80.153.193) [16:34:50] (03PS2) 10Alexandros Kosiaris: new_wmf_service: add conftool service config as well [puppet] - 10https://gerrit.wikimedia.org/r/227965 (owner: 10Giuseppe Lavagetto) [16:34:54] (03CR) 10Alexandros Kosiaris: [V: 032] new_wmf_service: add conftool service config as well [puppet] - 10https://gerrit.wikimedia.org/r/227965 (owner: 10Giuseppe Lavagetto) [16:36:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 110, down: 3, dormant: 0, excluded: 0, unused: 0BRxe-5/2/0: down - cr2-codfw:xe-5/2/0 {#10695} [10Gbps DF]BRxe-5/3/0: down - cr2-codfw:xe-5/3/0 {#10696} [10Gbps DF]BRae0: down - Core: cr2-codfw:ae0BR [16:36:31] PROBLEM - salt-minion processes on analytics1014 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:36:39] (03PS2) 10Alexandros Kosiaris: Fix typo introduced in 4efae00 [puppet] - 10https://gerrit.wikimedia.org/r/225824 [16:36:46] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix typo introduced in 4efae00 [puppet] - 10https://gerrit.wikimedia.org/r/225824 (owner: 10Alexandros Kosiaris) [16:36:48] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1503428 (10Eevans) >>! In T102015#1503250, @GWicke wrote: > @fgiunchedi, @eevans: Should we start the bootstrap for 1009 today? That's... [16:37:10] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:37:11] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [16:37:31] PROBLEM - salt-minion processes on analytics1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:38:39] uh, are things stable right now? I need to backport a patch so we get stacktraces in exception.log again [16:42:12] (03CR) 10coren: nrpe: Merge check_systemd_unit_lastrun into _state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/228329 (owner: 10coren) [16:42:21] RECOVERY - Router interfaces on cr1-codfw is OK host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [16:42:41] RECOVERY - Host cr2-codfw is UPING OK - Packet loss = 0%, RTA = 53.90 ms [16:42:41] legoktm: I'm not aware of any instabilities atm [16:42:51] (03PS1) 10Ottomata: Provision analytics1014 and analytics1020 as kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/228851 (https://phabricator.wikimedia.org/T106581) [16:43:01] ok [16:43:08] I was just concerned about icinga-wm complaining [16:43:20] RECOVERY - Router interfaces on cr2-eqiad is OK host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 [16:43:50] (03CR) 10Ottomata: [C: 032] Provision analytics1014 and analytics1020 as kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/228851 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [16:49:16] !log Removed today's Wikidata json dump (wikidata-20150803-all.json.gz) because it was incomplete due to the dataset problems earlier [16:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:45] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1503467 (10CCogdill_WMF) @BBlack I wanted to make sure you saw this as it relates to T107059. I don't have an ETA yet but expect this... [16:54:34] !log legoktm Synchronized php-1.26wmf16/includes/debug/logger/: https://gerrit.wikimedia.org/r/#/c/228850/ (duration: 00m 11s) [16:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:00] !log legoktm Synchronized php-1.26wmf16/autoload.php: https://gerrit.wikimedia.org/r/#/c/228850/ (duration: 00m 12s) [16:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:16] (03PS3) 10Legoktm: logging: Enable stacktrace printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228023 (https://phabricator.wikimedia.org/T107440) (owner: 10BryanDavis) [16:56:48] bd808: do you have a way to trigger exceptions to make sure the logging changes don't break anything? :P [16:57:15] legoktm: "use the wiki"? [16:57:20] lolol [16:58:04] I think if you tail the exception log you'll be able to tell pretty quickly [16:58:41] There is a hand extension tgr wrote to force errors for local testing [16:58:47] *handy [16:59:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I am not sure I concur with the way this change is implemented - we're ditching a working, clearly written script for the sake of rewritin" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/228329 (owner: 10coren) [16:59:57] bd808: I'm tailing it...but no new exceptions are coming in :( [17:00:02] !log Started dumpwikidatajson.sh on snapshot1003 to re-create today's dump [17:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:14] _joe_: The script was already becoming increasingly complicated with the overloaded functionality added, it'd just have made it worse. This way, the conditions are strictly linear, and easyly parsed. [17:00:31] <_joe_> Coren: my -1 is on the code :) [17:00:35] (03PS1) 10Cmjohnson: Removing mac address for decom'd host analytics1009 [puppet] - 10https://gerrit.wikimedia.org/r/228852 [17:01:48] ok, found a super expensive API request [17:01:54] (03CR) 10Legoktm: [C: 032] logging: Enable stacktrace printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228023 (https://phabricator.wikimedia.org/T107440) (owner: 10BryanDavis) [17:02:00] (03Merged) 10jenkins-bot: logging: Enable stacktrace printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228023 (https://phabricator.wikimedia.org/T107440) (owner: 10BryanDavis) [17:02:39] _joe_: Reading now. You really find nesting if/elif rather than egradually eleminating cases easier to read and understand? [17:02:40] !log legoktm Synchronized wmf-config/logging.php: logging: Enable stacktrace printing (duration: 00m 12s) [17:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:02:53] <_joe_> Coren: yes [17:03:15] <_joe_> Coren: you create the main branch on what you expect, not on what you find [17:03:23] <_joe_> it's easier not to mess up in corner cases [17:04:22] _joe_: Huh. I find the exact opposite as a rule; that corner cases are more easily tracked if you code against actual state and not expectations. [17:04:23] bd808: yay it's kind of better, except the stacktrace is separated by literal \n's [17:04:43] (03CR) 10Merlijn van Deen: "Small suggestion:" [puppet] - 10https://gerrit.wikimedia.org/r/228329 (owner: 10coren) [17:04:56] that's on purpose. did you want one giant line? [17:04:57] <_joe_> I said you have to branch the logic on what you expect [17:05:07] bd808: we can't have real newlines? [17:05:08] <_joe_> not that you have to code against what you expect [17:05:10] <_joe_> :) [17:05:33] legoktm: oh, it's '\n' not "\n" ? [17:06:09] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 3.39% of data above the critical threshold [1000.0] [17:06:17] bd808: like, http://fpaste.org/251072/62149214/raw/ is what I see [17:06:35] _joe_: That's what I meant anyways, branch the logic on actual state and deal with your expectations in the branches. But it's two ways of looking at the same thing - I don't mind either way. I find this way clearer, but I've no religious stance on it. :-) [17:07:13] legoktm: blerg. yeah I know why it's doing that. it's because the context is dumped as json [17:07:28] * bd808 ponders [17:07:38] PROBLEM - Host cr2-codfw is DOWN: CRITICAL - Network Unreachable (208.80.153.193) [17:07:49] addshore: we now have stacktraces if you can reproduce the error [17:08:06] legoktm: hmmm, well.... [17:08:24] * addshore has basically no idea how to reproduce that xD [17:08:39] lol [17:09:00] no indication at all in the logs as to what caused it / where it came from? :D [17:09:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 110, down: 3, dormant: 0, excluded: 0, unused: 0BRxe-5/2/0: down - cr2-codfw:xe-5/2/0 {#10695} [10Gbps DF]BRxe-5/3/0: down - cr2-codfw:xe-5/3/0 {#10696} [10Gbps DF]BRae0: down - Core: cr2-codfw:ae0BR [17:10:39] Well, its either Language::factory() or Parser::getTargetLanguage() [17:10:59] addshore: which bug was this? [17:11:04] https://phabricator.wikimedia.org/T107711#1503099 [17:11:19] PROBLEM - NTP on analytics1014 is CRITICAL: NTP CRITICAL: Offset unknown [17:12:14] addshore: oh that's a fatal, so no stacktraces at all. [17:12:27] addshore: https://phabricator.wikimedia.org/T89169 [17:13:09] PROBLEM - puppet last run on cp1074 is CRITICAL Puppet has 1 failures [17:13:20] RECOVERY - NTP on analytics1014 is OK: NTP OK: Offset -0.074447155 secs [17:13:21] bah [17:13:32] 6operations, 10ops-eqiad: db1035 died - network or power problem - https://phabricator.wikimedia.org/T107746#1503544 (10Cmjohnson) Jaime, This error is consistent with a failure with the system board. I will do some troubleshooting but the server is out of warranty and at the moment we do not have spare parts.... [17:14:09] well, Language::factory should always return a language... so Ill presume it isnt that one ... xD [17:14:23] 6operations, 5Continuous-Integration-Isolation: Remove hashar and dduvall root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1503545 (10chasemp) >>! In T95303#1492594, @hashar wrote: > labnodepool has been reinstalled from scratch. I might still need root over the next t... [17:14:35] unless it gets into Language::$mLangObjCache [17:14:42] legoktm@fluorine:/a/mw-log$ grep -c StubUserLang hhvm.log [17:14:42] 37 [17:14:54] addshore: add some debug logging? [17:15:13] will do! [17:16:18] RECOVERY - Router interfaces on cr1-codfw is OK host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [17:16:19] RECOVERY - Host cr2-codfw is UPING OK - Packet loss = 0%, RTA = 52.74 ms [17:17:19] PROBLEM - puppet last run on cp1049 is CRITICAL Puppet has 1 failures [17:17:26] addshore: it only started showing up yesterday [17:17:56] okay, how many per hour / day? [17:18:59] 6operations, 10ops-eqiad: db1035 died - network or power problem - https://phabricator.wikimedia.org/T107746#1503579 (10jcrespo) > What is the status of decom'ing some of the older db's I will put that with higher priority, as well as some new hardware replacement. If the board is confirmed to be fried, the d... [17:19:41] (03PS1) 10Aude: Enable usage tracking on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228856 [17:20:35] addshore: sent you a pastebin [17:20:40] cheers [17:21:04] !log switching from node 0.10 to iojs 2.5 on restbase1001 after load testing on xenon went well [17:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:19] addshore: a few per hour [17:21:29] !log starting kafka partition reassignment to balance all partiions over to 3 new kafka brokers and off of analytics1021 [17:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:35] some obviously appear to be someone repeatedly tryingi to save a page :( [17:21:43] !log legoktm Synchronized php-1.26wmf16/includes/Revision.php: https://gerrit.wikimedia.org/r/228853 (duration: 00m 12s) [17:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:58] aude: legoktm so add logging for these 2 core methods or 1 in Wikibase? ;) [17:22:04] and BP? :D [17:22:55] (03PS5) 10RobH: Add krinkle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/227728 (https://phabricator.wikimedia.org/T107243) [17:23:13] addshore: dunno, whatever is more useful. We can merge it to just a wmf/ branch [17:23:41] (03CR) 10RobH: [C: 032] Add krinkle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/227728 (https://phabricator.wikimedia.org/T107243) (owner: 10RobH) [17:25:08] robh: bringing https://phabricator.wikimedia.org/T107769 to your attention [17:25:30] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to analytics cluster for user krinkle - https://phabricator.wikimedia.org/T107243#1503593 (10RobH) 5Open>3Resolved No blockers after 3 day wait, this is now merged live. It will take ~30 minutes to hit the affected server(s). [17:29:10] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 16.67% of data above the critical threshold [500.0] [17:29:48] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1022 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 23.0 [17:30:08] PROBLEM - Kafka Broker Replica Lag on analytics1012 is CRITICAL: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value CRITICAL: 2273627062.0 [17:30:38] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 23.0 [17:30:57] hmmm, this is ok. on it. [17:32:29] PROBLEM - Kafka Broker Replica Lag on analytics1022 is CRITICAL: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value CRITICAL: 2270083699.0 [17:32:48] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1018 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 24.0 [17:33:39] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 22.0 [17:34:11] (03PS1) 10Ottomata: Set request.required.acks to -1 for varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/228858 [17:34:18] PROBLEM - Kafka Broker Replica Lag on analytics1018 is CRITICAL: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value CRITICAL: 2267478073.0 [17:34:31] (03CR) 10Ottomata: [C: 032 V: 032] Set request.required.acks to -1 for varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/228858 (owner: 10Ottomata) [17:34:50] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 5 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1503623 (10coren) [17:35:21] (03CR) 10Aude: [C: 032] Enable usage tracking on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228856 (owner: 10Aude) [17:35:27] (03Merged) 10jenkins-bot: Enable usage tracking on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228856 (owner: 10Aude) [17:35:30] legoktm: aude https://gerrit.wikimedia.org/r/#/c/228859/2 [17:36:10] !log aude Synchronized usagetracking.dblist: Enable usage tracking on zhwiki (duration: 00m 12s) [17:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:56] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1503645 (10coren) 5Open>3Resolved T107574 contains the remaining todo [17:39:59] PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures [17:41:06] addshore: uh, do you want to patch the "Wikidata" extension instead? that's the one that actually gets deployed AIUI [17:41:30] RECOVERY - salt-minion processes on analytics1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:42:19] RECOVERY - puppet last run on cp1074 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:43:39] PROBLEM - puppet last run on cp1069 is CRITICAL Puppet has 1 failures [17:44:28] RECOVERY - puppet last run on cp1049 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:44:29] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Assign an LVS service to the wikidata query service - https://phabricator.wikimedia.org/T107601#1503682 (10Smalyshev) [17:45:05] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, 3Labs-Sprint-108: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1503684 (10Andrew) [17:46:09] PROBLEM - Disk space on analytics1020 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [17:46:16] (03PS2) 10Alexandros Kosiaris: Reorder bacula keypair key/certificate [puppet] - 10https://gerrit.wikimedia.org/r/219847 [17:47:40] PROBLEM - salt-minion processes on analytics1014 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:47:59] PROBLEM - Kafka Broker Server on analytics1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [17:49:03] 6operations, 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1503690 (10Andrew) [17:49:16] ottomata: you about? [17:49:20] we all got that page =] [17:52:49] andrewbogott: heyas, yer on duty! im trying to look at this kafka error [17:52:59] but the docs on wikitech are not accurate in therms of listing jobs and the like [17:53:06] robh: in a meeting, but I’ll be with you shortly :) [17:53:24] well, our ops meeting is in 7 so if no one fixes by then, it'll be an all team issue [17:53:26] heh [17:55:18] PROBLEM - Disk space on analytics1013 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [17:57:38] PROBLEM - Kafka Broker Server on analytics1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [17:58:49] PROBLEM - Disk space on analytics1014 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [17:59:08] ottomata: are you on this?^ I'm clearning out PD if so [18:02:47] hey, yes [18:02:48] on it [18:02:52] sorry was eating lunch [18:02:57] am adding new brokers and moving partitions [18:03:04] i think things are ok, double checking (never done this in prod before) [18:03:09] oh [18:03:09] PROBLEM - Kafka Broker Server on analytics1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [18:03:24] please !log in the future [18:03:37] also see: [18:03:37] 20:55 < icinga-wm> PROBLEM - Disk space on analytics1013 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [18:03:41] 20:58 < icinga-wm> PROBLEM - Disk space on analytics1014 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [18:04:16] i did log [18:04:29] /var/spool/kafka/a & /b are not separate mountpoints [18:04:31] yeah i see that, looks like some of the disk partitions [18:04:32] so / is full [18:04:32] yeah [18:04:34] ! [18:04:35] hmmmm [18:04:37] thanks on it [18:05:28] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [18:05:35] ottomata: do you need me or can I switch my focus to the ops meeting? [18:05:56] i think i got it [18:05:57] you can do meeting [18:05:59] ok [18:05:59] i was going to join meetting [18:06:02] ping me if you need me [18:06:36] haha, the air conditioner here just tripped the breaker, it is hot and kafka is upset [18:08:56] nice ottomata [18:09:08] RECOVERY - puppet last run on cp1069 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:09:08] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [18:09:09] 6operations, 10Analytics, 6Security: Purge > 90 days stat1002:/a/squid/archive/glam_nara - https://phabricator.wikimedia.org/T92340#1503756 (10leila) @Multichill: who from GLAM we can talk to to figure out what data needs to be kept and in what form it will be potentially useful in the future? I'm proposing... [18:09:52] 6operations, 10RESTBase-Cassandra, 7Blocked-on-Services: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1503760 (10mobrovac) [18:10:08] 6operations, 10RESTBase-Cassandra: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1503761 (10mobrovac) [18:11:09] 6operations, 10RESTBase-Cassandra, 7Blocked-on-Services: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1503763 (10GWicke) See T104887 for a discussion of what happened after downgrading to JDK7. [18:15:38] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:18:12] (03CR) 10Chad: Phabricator: Setup git config for all repositories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227488 (owner: 10Chad) [18:23:50] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 9.09% of data above the critical threshold [500.0] [18:26:27] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1503798 (10mobrovac) >>! In T102015#1503428, @Eevans wrote: >>>! In T102015#1503250, @GWicke wrote: >> @fgiunchedi, @eevans: Should we s... [18:28:35] !log switched restbase1002 and restbase1003 to iojs as well [18:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:35:09] RECOVERY - Disk space on analytics1013 is OK: DISK OK [18:35:23] (03PS1) 10Ori.livneh: Increase the log level of error logs to LOG_NOTICE [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/228876 [18:35:35] paravoid: ^ [18:36:38] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:38:39] RECOVERY - Disk space on analytics1014 is OK: DISK OK [18:39:00] (03CR) 10Ori.livneh: "Threw this up on Gerrit so @paravoid can see the diff; I'll update the commit message later with context/rationale." [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/228876 (owner: 10Ori.livneh) [18:39:05] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1503834 (10Tau) 1) Do I run it via command line? `git fetch https://gerrit.wikimedia.org/r/mediawiki/core refs/changes/18/223... [18:43:49] RECOVERY - Kafka Broker Server on analytics1013 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [18:44:59] RECOVERY - Disk space on analytics1020 is OK: DISK OK [18:44:59] RECOVERY - Kafka Broker Server on analytics1014 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [18:46:39] PROBLEM - puppet last run on analytics1014 is CRITICAL Puppet has 1 failures [18:46:58] RECOVERY - Kafka Broker Server on analytics1020 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [18:52:26] (03CR) 10Giuseppe Lavagetto: WIP: Add etcd configuration client (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [18:55:30] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1503952 (10RobH) Well, the sudo rights were approved, I'll add them later today and then we can troubleshoot the issue further. Sorry you are having issues, we'll get th... [18:57:08] 6operations, 6Collaboration-Team, 10Collaboration-Team-Sprint-F-Finishing-Move-2015-08-04, 10Flow: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#1499219 (10Mattflaschen) [18:57:11] 6operations, 6Collaboration-Team, 10Collaboration-Team-Sprint-F-Finishing-Move-2015-08-04, 10Flow: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#1503967 (10Mattflaschen) a:5Mattflaschen>3None [18:57:59] (03PS1) 10BBlack: Revert "tlsproxy: let nginx use keepalives to varnish" [puppet] - 10https://gerrit.wikimedia.org/r/228882 [18:58:04] (03PS2) 10BBlack: Revert "tlsproxy: let nginx use keepalives to varnish" [puppet] - 10https://gerrit.wikimedia.org/r/228882 [18:58:11] (03CR) 10BBlack: [C: 032 V: 032] Revert "tlsproxy: let nginx use keepalives to varnish" [puppet] - 10https://gerrit.wikimedia.org/r/228882 (owner: 10BBlack) [19:00:49] ok, now I have a problem, totally unforseen [19:01:15] varnishkafka is not happy with my changes [19:01:18] (03PS1) 10Aude: Enable usage tracking on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228883 (https://phabricator.wikimedia.org/T100785) [19:02:15] !log https://gerrit.wikimedia.org/r/228882 reversion salted + nginx reloaded [19:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:58] (03PS8) 10Hoo man: Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [19:07:15] !log stopped a couple of kafka brokers. acknowldeging.. [19:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:25] (03CR) 10Hoo man: "Rebased" [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [19:09:29] PROBLEM - Restbase root url on restbase1003 is CRITICAL: Connection refused [19:09:49] PROBLEM - Varnishkafka log producer on cp1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [19:09:49] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [19:10:43] (03PS1) 10Aude: Add debug log group for T107711 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228886 [19:11:08] * aude needs to deploy in a few minutes [19:11:27] my window is taking extra long today :/ [19:11:31] (but then will be done) [19:11:38] RECOVERY - Restbase root url on restbase1003 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.007 second response time [19:11:59] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [19:12:28] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [19:13:29] (03PS2) 10JanZerebecki: Add query.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/228411 (https://phabricator.wikimedia.org/T107602) [19:18:08] RECOVERY - Varnishkafka log producer on cp1068 is OK: PROCS OK: 1 process with command name varnishkafka [19:19:01] (03CR) 10Aude: [C: 032] Enable usage tracking on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228883 (https://phabricator.wikimedia.org/T100785) (owner: 10Aude) [19:19:10] (03Merged) 10jenkins-bot: Enable usage tracking on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228883 (https://phabricator.wikimedia.org/T100785) (owner: 10Aude) [19:19:20] (03CR) 10Aude: [C: 032] Add debug log group for T107711 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228886 (owner: 10Aude) [19:19:27] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1504078 (10BBlack) 5Resolved>3Open patch reverted, was causing an appreciable rate of 502 errors on upload.wm.o, likely due to using too many parallel sockets to varnis... [19:19:27] (03Merged) 10jenkins-bot: Add debug log group for T107711 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228886 (owner: 10Aude) [19:20:08] 7Puppet, 10Beta-Cluster: Puppet failures on deployment-mx: can't find puppet://private/dkim/wikimedia.org-wiki-mail.key - https://phabricator.wikimedia.org/T87848#1504088 (10mmodell) [19:20:12] 6operations, 10Beta-Cluster: puppet fail on deployment-mx - https://phabricator.wikimedia.org/T106660#1504087 (10mmodell) [19:20:33] !log aude Synchronized wmf-config/InitialiseSettings.php: Add debug log group for T107711 (duration: 00m 12s) [19:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:21:22] !log aude Synchronized usagetracking.dblist: Enable usage tracking on enwiki (duration: 00m 12s) [19:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:21:31] :) [19:21:31] (03PS9) 10Hoo man: Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [19:21:59] (03CR) 10ArielGlenn: [C: 032] make xml{stubs,abstracts,logs}.py behave as dumpBackup.php does [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/225862 (owner: 10ArielGlenn) [19:22:49] and doing one backport, while i'm on tin [19:25:01] (03CR) 10Hoo man: "Pushing those files to /usr/local/share/dcat/ via puppet now. @Ariel: I think we're good to go here (please have a look at the puppet chan" [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [19:28:06] (03PS1) 10Andrew Bogott: Remove hashar and dan as roots on labnodepool: [puppet] - 10https://gerrit.wikimedia.org/r/228890 (https://phabricator.wikimedia.org/T95303) [19:28:25] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1504142 (10Tgr) The easiest way is probably to run `curl -s 'https://git.wikimedia.org/patch/mediawiki%2Fcore/644463979255762... [19:31:53] aude: Don't forget to push the wikidata update you just merged [19:32:40] hoo: waiting for jenkins [19:32:50] which just merged [19:33:05] oh, didn't realize that [19:33:12] * hoo hides [19:33:27] :) [19:35:39] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:36:41] jynus: are you still working on https://phabricator.wikimedia.org/T106637, or should I inherit it since I’m on clinic duty this week? [19:38:16] (03PS3) 10Smalyshev: T105080: add maintenance mode configs for nginx [puppet] - 10https://gerrit.wikimedia.org/r/228140 [19:39:09] (03PS4) 10Smalyshev: T105080: add maintenance mode configs for nginx [puppet] - 10https://gerrit.wikimedia.org/r/228140 [19:40:07] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Isolation: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1504200 (10Andrew) I'm a bit confused by this task. Is it still just that you need python-os-client-config downloa... [19:49:06] !log aude Synchronized php-1.26wmf16/extensions/Wikidata: Fix T104609 and fix/debug T107711 (duration: 00m 19s) [19:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:35] (03PS2) 10Dzahn: Add missing ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/228837 (https://bugzilla.wikimedia.org/105040) (owner: 10Muehlenhoff) [19:50:20] (03CR) 10Dzahn: [C: 032] Add missing ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/228837 (https://bugzilla.wikimedia.org/105040) (owner: 10Muehlenhoff) [19:53:05] (03CR) 10Dzahn: "already done in patches by others" [puppet] - 10https://gerrit.wikimedia.org/r/205904 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [19:53:55] (03Abandoned) 10Dzahn: dumps: put base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/205904 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [19:56:38] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [20:00:04] gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150803T2000). [20:17:49] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:26:19] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [20:26:37] !log updated Parsoid to version 38d0cdb13734a40bc2908e779e1a0cde158048f2 [20:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:59] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:41:56] 6operations: Replace SSH keys for jamesur - https://phabricator.wikimedia.org/T107798#1504373 (10Jalexander) 3NEW [20:42:22] (03PS2) 10Jalexander: Replace ssh key for jamesur [puppet] - 10https://gerrit.wikimedia.org/r/228597 (https://phabricator.wikimedia.org/T107798) [20:43:32] Jamesofur: I dislike the comment inconsistency :( [20:43:44] you're going to have to be more specific then that :) [20:44:01] going from MonthDDYYYY to YYYYMM [20:44:10] it's like the SAL log, weird change ;) [20:44:47] (03PS1) 10Ori.livneh: role::mediawiki: tune down nutcracker log verbosity [puppet] - 10https://gerrit.wikimedia.org/r/228976 [20:45:00] JohnFLewis: well you're in luck! You won't be able to see both at once :P only the new one :) [20:45:18] (the date is in there just so that I know when it was last changed, obviously the important part is that it's unique to the servers and nothing else) [20:45:30] (well... and other things but that's the important part of the comment for me) [20:45:33] yeah [20:45:34] https://xkcd.com/1179/ [20:45:55] you're keys follow: [20:46:10] Krenair, legoktm: https://gerrit.wikimedia.org/r/#/c/228768/ ? [20:46:11] bd808: this is true, but let's be honest, I don't actually CARE if anyone else understands that date ;) [20:46:12] * Jamesofur ducks [20:46:21] bd808: I want to -1 the patch and comment that now :) [20:46:22] JohnFLewis: yup basically, my other keys follow that pattern too [20:46:24] ori: after the branch cut? [20:46:35] ori: so it gets in tech/news [20:46:37] (03PS5) 10Dzahn: dumps: put base::firewall on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/205903 (https://phabricator.wikimedia.org/T104939) [20:46:44] Jamesofur: nor should you really. [20:46:44] legoktm: when does that happen, again? [20:46:48] (03PS1) 10ArielGlenn: ability to skip jobs for dump runs but mark runs as complete [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228978 [20:46:53] ori: tomorrow morning [20:46:54] (03CR) 10Ori.livneh: [C: 032] role::mediawiki: tune down nutcracker log verbosity [puppet] - 10https://gerrit.wikimedia.org/r/228976 (owner: 10Ori.livneh) [20:47:02] s/morning/sometime/ [20:47:13] oh god [20:47:20] it was deprecated in 2012 [20:47:38] it has been wrapped in mw.log.deprecate() for some years too [20:47:45] let's just kill it [20:47:58] I thought there were still gadgets using it? [20:48:07] but it is nasty cruft for sure [20:48:08] so? [20:48:12] (03PS3) 10Andrew Bogott: Replace ssh key for jamesur [puppet] - 10https://gerrit.wikimedia.org/r/228597 (https://phabricator.wikimedia.org/T107798) (owner: 10Jalexander) [20:48:17] there are people using ie6 [20:48:28] at some point your obligation to support them expires [20:48:35] ottomata: https://wikitech.wikimedia.org/wiki/Webperf It runs on hafnium [20:48:41] ori: killing it is fine, but lets at least announce that a date has been set. It's been deprecated since 2012, another day isn't going to make a big difference. [20:48:43] there are people writing dates as mm/dd/yy and dd/mm/yy! [20:48:53] * Nemo_bis screams [20:49:09] the are americans writing dates as mm/dd/yy and the rest of the world [20:49:16] (03CR) 10Andrew Bogott: [C: 032] Replace ssh key for jamesur [puppet] - 10https://gerrit.wikimedia.org/r/228597 (https://phabricator.wikimedia.org/T107798) (owner: 10Jalexander) [20:49:40] legoktm: blah, fine [20:50:56] (03PS6) 10Dzahn: dumps: put base::firewall on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/205903 (https://phabricator.wikimedia.org/T104939) [20:52:19] andrewbogott: if you have time https://gerrit.wikimedia.org/r/#/c/224558/ and the DNS patch would be great reviews [20:52:20] (03CR) 10Dzahn: [C: 032] dumps: put base::firewall on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/205903 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [20:52:42] andrewbogott: it sounds like cmjohnson1 could do with some spare parts from old db servers and already has a DBA +1 [20:54:59] (03CR) 10BryanDavis: Add configuration for authmetrics logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227630 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [20:55:40] (03PS2) 10ArielGlenn: ability to skip jobs for dump runs but mark runs as complete [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228978 [20:56:38] JohnFLewis: I will look, thanks [20:56:58] (03CR) 10ArielGlenn: [C: 032] ability to skip jobs for dump runs but mark runs as complete [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228978 (owner: 10ArielGlenn) [20:59:40] (03PS2) 10BryanDavis: Remove code duplication from monolog config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224188 (owner: 10Gergő Tisza) [20:59:51] (03PS1) 10ArielGlenn: fix regression in chunked dump production [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228979 [21:00:11] (03PS2) 10BBlack: bits.wm.o -> text-cluster [dns] - 10https://gerrit.wikimedia.org/r/228021 (https://phabricator.wikimedia.org/T95448) [21:00:29] (03CR) 10BBlack: [C: 031] bits.wm.o -> text-cluster [dns] - 10https://gerrit.wikimedia.org/r/228021 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [21:02:45] (03PS2) 10ArielGlenn: fix regression in chunked dump production, don't fail on empty tables [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228979 [21:03:12] (03CR) 10BryanDavis: [C: 032] Remove code duplication from monolog config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224188 (owner: 10Gergő Tisza) [21:03:18] (03Merged) 10jenkins-bot: Remove code duplication from monolog config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224188 (owner: 10Gergő Tisza) [21:03:36] (03CR) 10ArielGlenn: [C: 032] fix regression in chunked dump production, don't fail on empty tables [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228979 (owner: 10ArielGlenn) [21:04:09] 6operations, 5Patch-For-Review: Replace SSH keys for jamesur - https://phabricator.wikimedia.org/T107798#1504425 (10Andrew) 5Open>3Resolved a:3Andrew [21:04:49] !log bd808 Synchronized wmf-config/logging.php: Remove code duplication from monolog config (Ia960203) (duration: 00m 11s) [21:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:01] 6operations: staged dumps tracking task - https://phabricator.wikimedia.org/T107757#1504432 (10ArielGlenn) [21:05:02] 6operations: allow dumps to be treated as 'done' even though some steps are skipped - https://phabricator.wikimedia.org/T107758#1504430 (10ArielGlenn) 5Open>3Resolved https://gerrit.wikimedia.org/r/#/c/228978/ fixes this. [21:10:04] (03CR) 10BBlack: [C: 032] bits.wm.o -> text-cluster [dns] - 10https://gerrit.wikimedia.org/r/228021 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [21:10:39] akosiaris, welcome back! [21:11:35] (03CR) 10Andrew Bogott: [C: 032] "Looks like there's more work to do before we can delete the role; sounds like Chris would like to decom these servers in the meantime." [puppet] - 10https://gerrit.wikimedia.org/r/224558 (owner: 10John F. Lewis) [21:11:41] (03PS4) 10Andrew Bogott: remove db100[2-7] from install_server and coredb [puppet] - 10https://gerrit.wikimedia.org/r/224558 (owner: 10John F. Lewis) [21:13:33] (03CR) 10Andrew Bogott: [C: 032] remove db100[2-7] from install_server and coredb [puppet] - 10https://gerrit.wikimedia.org/r/224558 (owner: 10John F. Lewis) [21:14:08] JohnFLewis: merged [21:14:16] andrewbogott: awesome [21:14:53] andrewbogott: https://gerrit.wikimedia.org/r/#/c/224560/ perhaps the DNS one too? leaves Chris the asset tags so he can strip them out from there [21:16:35] (03PS3) 10Andrew Bogott: remove db100[2-7]{.mgmt}.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/224560 (https://phabricator.wikimedia.org/T105768) (owner: 10John F. Lewis) [21:17:57] (03CR) 10Andrew Bogott: [C: 032] remove db100[2-7]{.mgmt}.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/224560 (https://phabricator.wikimedia.org/T105768) (owner: 10John F. Lewis) [21:21:07] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1504466 (10mpopov) @RobH any updates? [21:22:02] Krinkle: thanks! [21:22:30] 6operations, 10ops-eqiad, 7Database, 5Patch-For-Review: Remove db1002-db1007 from production - https://phabricator.wikimedia.org/T105768#1504476 (10JohnLewis) a:3Cmjohnson WMF asset tags still exist in DNS. This is all yours now to do what ever you need to with these servers really. [21:22:37] cmjohnson1: ^ enjoy servers [21:22:48] awesome! thanks [21:23:15] that helps quite a bit...ottomata will be happy to know i can rack more hadoop nodes [21:23:40] robh: Hi! Can you (or someone else) help bearloga here with his access request? [21:23:48] robh: It's tracked here: https://phabricator.wikimedia.org/T107043 [21:24:03] I updated it today [21:24:07] that i was going to get to it today [21:24:10] i just havent yet =[ [21:24:22] so, andrewbogott if you wanna jump in as clinic duty [21:24:30] Deskana: otherwise i'll try to help soon [21:24:44] robh: Thanks. bearloga's really having trouble because right now he's a data analyst with no access to any of our data. :-) [21:25:12] there are dumps! [21:25:25] what Deskana said. might be worth ottomata chipping in since he's the poor sod who had to set up access for the rest of us :) [21:27:19] robh: ok, will catch up [21:27:20] cmjohnson1: !!! :D:D:D [21:27:48] Deskana: i will be able to help tomorrow, i am a little frantic trying to fix something [21:28:04] that something is kafka as I remember so that taking priority makes sense ;p [21:28:31] Deskana: that is the best excuse for not doing any work :) [21:28:46] "Sorry, I'm a data analyst with no data" [21:29:05] Indeed, and it's also a waste of money, and I like to avoid those. [21:30:29] also it will slowly drive me insane, and ditto. [21:32:54] ok, now done with one root password thing [21:32:56] lets see [21:33:11] im going to go ahead and push all the new sudo level access for bearloga and see if perhaps it fixes all our issues. [21:33:21] since the sudo is for access on varying machines that overlap as well [21:33:35] ie: no point troubleshooting the existing issue since this is going to change things anyhow [21:34:02] robh let me know when it's worth trying again [21:34:23] andrewbogott: you can stop catcihng up if you still are =] [21:34:36] oh, oops [21:34:40] I had a patch and everything :) [21:35:00] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1504534 (10RobH) working on this now [21:35:44] robh: Thank you for helping. :-) [21:36:41] bearloga: what's it like being a data-less data analyst? :) [21:38:20] (03PS1) 10RobH: adding bearloga to restricted and statistics-admins [puppet] - 10https://gerrit.wikimedia.org/r/228991 [21:39:01] (03CR) 10RobH: [C: 032] "approved during the ops meeting today" [puppet] - 10https://gerrit.wikimedia.org/r/228991 (owner: 10RobH) [21:39:05] (03CR) 10Dzahn: [C: 031] Add query.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/228411 (https://phabricator.wikimedia.org/T107602) (owner: 10JanZerebecki) [21:42:24] bearloga: so i can see your ssh key on stat1002 [21:42:29] well, your pub key that is [21:43:13] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1504553 (10RobH) I've merged the sudo level rights live and his pub key shows on stat1002. [21:43:34] robh still getting denied when trying to get into stat1002 :( [21:43:48] right, can you go ahead and use the -v flag [21:43:53] and then paste it onto https://phabricator.wikimedia.org/paste/create/ ? [21:44:01] ssh -v stat1002..... [21:44:11] I think its your ssh config, since your ssh key exists. [21:44:28] You have to route in via a bastion and all [21:45:50] Also, there are directions on how to configure your options for our cluster on https://wikitech.wikimedia.org/wiki/SSH_access [21:46:18] but, first the ssh -v paste [21:46:19] =] [21:47:34] (03CR) 10Dzahn: Add ferm rules for Logstash/Elasticsearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [21:49:37] robh https://phabricator.wikimedia.org/P1817 [21:51:04] (03PS4) 10Dzahn: Remove several dead domains from redirects [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) (owner: 10Glaisher) [21:52:00] ok, im logging in and comparing our output [21:52:21] (03CR) 10Dzahn: [C: 032] Remove several dead domains from redirects [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) (owner: 10Glaisher) [21:55:05] bearloga: can you paste in the rest of your config? [21:55:14] (03CR) 10Jforrester: [C: 031] "Scheduled for SWAT in the morning." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227329 (owner: 10Jforrester) [21:55:15] its referencing lines 20 and 102 [21:55:27] so im not exactly sure what logic its applying, i can see you hit bastion [21:55:33] but i dont even see it attempt to hit stat1002 [21:55:56] i can only see the server host key match for your bastion attempt [21:56:03] anybody around here could help me with diamond collector? [21:56:14] it should show those even before your key is refused, since it checks your known hosts first [21:56:25] then it checks creds, is my understanding [21:56:59] but its odd, and your key does certainly exist on both stat1002 and bast1001. (So it should work, you arent crazy, something is wrong) [21:57:29] the part in () typically goes unsaid, but im sure this has driven you slightly mad over the past couple of days ;D [21:58:01] you can append your config into the comments of the other paste [21:59:17] PROBLEM - Kafka Broker Server on analytics1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [21:59:24] ottomata: ^ [21:59:26] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [21:59:36] thats me again [21:59:37] i'm on it [21:59:37] PROBLEM - Kafka Broker Server on analytics1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [21:59:40] just making sure =] [21:59:42] i schedueld them for maintaince [21:59:45] robh heh, thanks. it slightly has. okay, I edited the paste to include the full config [21:59:45] time must have ran out [21:59:47] awesome [22:00:01] ottomata: it happens [22:00:20] having a stressful day! [22:00:21] wooooo! [22:00:36] * Ironholds hugs ottomata [22:00:59] ori: yt? [22:01:10] bearloga: well, it says its applying something from a line 102 [22:01:19] seems you have Users/mpopov/.ssh/config [22:01:24] and /etc/ssh_config [22:01:26] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1504586 (10awight) [22:01:32] can you tell me what the line 102 from that is (in here is fine) [22:01:37] i just want to rule it out as a potential issue [22:02:08] my /etc/ssh_config is quite short [22:02:14] so a 102 line config was a bit odd to see [22:02:33] debug1: Reading configuration data /Users/mpopov/.ssh/config [22:02:33] debug1: Reading configuration data /etc/ssh_config [22:02:33] debug1: /etc/ssh_config line 20: Applying options for * [22:02:35] debug1: /etc/ssh_config line 102: Applying options for * [22:02:51] those in particular [22:02:57] RECOVERY - Kafka Broker Replica Lag on analytics1012 is OK: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value OKAY: 0.0 [22:02:57] Lines 101-103:# XAuthLocation added by XQuartz (http://xquartz.macosforge.org) [22:02:58] Host * [22:02:58] XAuthLocation /opt/X11/bin/xauth [22:02:58] and line 20 in your other config is HostName bastion-eqiad.wmflabs.org? [22:03:04] which may be th eissue then [22:03:12] hrmm [22:03:30] sorry, line 20 is Host bastlabs [22:03:45] try commenting out that entire stanza and try again [22:03:47] just for kicks [22:04:00] (unless like 20 is somehting else in config) [22:05:59] (03CR) 10Jalexander: [C: 031] "This can get merged now." [puppet] - 10https://gerrit.wikimedia.org/r/184637 (owner: 10Anomie) [22:07:38] (03CR) 10GWicke: "https://gerrit.wikimedia.org/r/#/c/228429/ has since been merged, and the page is protected." [puppet] - 10https://gerrit.wikimedia.org/r/228426 (https://phabricator.wikimedia.org/T107086) (owner: 10GWicke) [22:08:35] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1504620 (10BBlack) a:3BBlack Cert is ordered, waiting for issue [22:09:59] (03CR) 10GWicke: "The listing is currently live at https://meta.wikimedia.org/w/extract2.php?template=API_listing_template; this patch is about adding the r" [puppet] - 10https://gerrit.wikimedia.org/r/228426 (https://phabricator.wikimedia.org/T107086) (owner: 10GWicke) [22:17:11] bearloga: Let me know if commenting out the stanza(s) at line 20 helps, or changes the -v output [22:18:20] robh did not hep :\ i posted the new output in the comments [22:19:54] hrmm [22:20:44] ah ha! [22:20:49] bearloga: I think I found it! [22:20:50] debug1: Offering RSA public key: /Users/mpopov/.ssh/github_rsa [22:21:12] hrmm [22:21:18] actually.. it tries your main key first [22:21:26] but.... are you running a ssh session with multiple keys loaded? [22:21:27] yeah, I was gonna say [22:21:32] that is not a good practice ;D [22:21:42] but, shouldnt hurt this. [22:23:16] it is odd [22:23:21] I can see it offer the right key debug1: Offering RSA public key: /Users/mpopov/.ssh/id_rsa [22:24:07] in my output I get debug1: Server accepts key: [22:24:11] right after the offer [22:24:18] then debug1: Authentication succeeded (publickey). [22:25:43] and I get those messages too…but only for bastion-eqiad.wmflabs.org (https://phabricator.wikimedia.org/P1817#7182), just not stat1002 :\ [22:25:50] bearloga: I may have asked this before, but are you able to get into stat1003? [22:26:37] I mean, we can see you hitting bast1001 so your initial connection is good [22:26:48] its just odd how you are routing after that [22:27:24] I dont even see it hitting the stat server in your output [22:27:26] and it should [22:27:26] robh not able to get into that either. same output [22:28:07] bearloga: oh.... im noticing this in your config [22:28:15] is your labs and production key the same? [22:28:34] because, they arent allowed to be... the doc you signed acknowledged that... (so i hope im reading this wrong) [22:28:46] if they are, its ok, we can simply make a new production key ;] [22:29:57] Also I think your config is misrouting you based on yoru labs items. The next step at this point is you should comment out everything in your user config except the production ONLY items [22:30:02] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Add Redis to maps cluster - https://phabricator.wikimedia.org/T107813#1504732 (10Yurik) 3NEW [22:30:06] comment all labs, I suspect one of them is borking it [22:30:15] since you arent hitting any internal servers past bast1001. [22:30:43] (i think your config is routing you into bast1001, and then trying to route to a labs bastion, just cuz i see no hits on internal non labs systems you are attempting to reach) [22:32:54] bearloga: also i hope the ;] indicated that having something not configured right at this point is ok, we throw a shit ton of info at you when you start [22:32:54] sorry! I'm going to generate another key for you to use. posting new public key in phabricator now. also commented out all labs (still denied) just to test your theory [22:33:09] cool [22:33:29] i'll push your key immediately if you want to do that to phab first [22:33:44] feel free to test theory since it'll take me a few minutes to make the patchset and push live [22:34:14] When you use your ssh agents in the future [22:34:15] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1504747 (10mpopov) New pubkey to use for production (different from labs pubkey): ``` $> cat /users/mpopov/.ssh/wmf_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC27kIX... [22:34:23] you want to load ONLY labs or production into any session [22:34:32] I dont recall how to configure this offhand in linux, only os x [22:34:48] using os x, actually :) [22:34:49] (your terminal software dictates this) [22:34:53] ahh, then thats easy! [22:34:59] you are using the stock terminal app right? [22:35:03] yup [22:35:30] preferences > Profiles > Shell > Check Run Command @ Startup: eval `ssh-agent` [22:35:35] with the run inside shell checked [22:35:45] that makes every terminal window/tab run its own silo'd ssh agent [22:35:55] so loading github in one tab doesnt share to another one iwth labs [22:35:59] or another one with your production key [22:36:16] once you insert, opening a new tab shoudl work [22:36:20] you can test with ssh-add -L [22:36:27] to list off identities loaded [22:36:46] the drawback is now you have to load the key to every tab every time (which is also the benefit ;) [22:37:17] If you do not do that, someone has a greater attack vector to hijack your production ssh details when you connect to labs [22:37:30] or to anything else [22:37:47] (though it is small, and stopping key forwarding closes that gap up pretty effectively, its just good practice) [22:38:47] robh thank you! going to do this now. also, commenting out labs didn't fix. [22:39:01] ok, very strange [22:39:05] ok, im pushing your new key now [22:40:15] (03PS1) 10RobH: changing bearloga's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/229009 [22:40:26] lemme get this merged, im also pondering ;] [22:40:51] (03CR) 10RobH: [C: 032] changing bearloga's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/229009 (owner: 10RobH) [22:42:32] (03PS1) 10Ottomata: Remove newly provisioned kafka nodes from cluster [puppet] - 10https://gerrit.wikimedia.org/r/229012 (https://phabricator.wikimedia.org/T106581) [22:42:49] so I did that and every new tab is loading with that labs key (id_rsa). is this intended behavior or should "ssh-add -L" show me no keys loaded? [22:42:51] ok, its live and im pushing puppet updates on bastion and stat1002 (so we can keep testing) [22:43:05] it should show no eys, buttttt you had labs loaded before you started [22:43:09] try quitting terminal entirely [22:43:11] and reopening [22:43:18] (not just closing the windows) [22:43:40] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1504816 (10CCogdill_WMF) Thanks for this and for resolving T107060 @BBlack. If I can get you information on changing to https sooner than 10/7, I will. > I'm still a bit worried about the timelin... [22:43:53] you are correct in that it shouldn't show any keys loaded, and ssh-add -L should show that [22:44:04] nope, still auto-loading [22:44:48] I'm sorry for being such an ops PITA [22:46:21] im sure you're not intentionally making it break ;] [22:46:40] I'm looking over your config and i just dont see why it would do that [22:46:52] unless you have some kind of odd /etc/resolv override for *.eqiad.wmnet stuff? [22:47:10] but meh.. nah, even that makes no sense [22:47:17] you resolve to bast, it then uses ITS resolv [22:47:51] lemme braindump what we've tried to task... im starting to think we need to pull someone else in for another set of eyes [22:47:54] easier if the task is updated [22:48:24] bearloga: so just to confirm while im updating, you can ssh into bast1001 directly no problem [22:48:34] but when that works, the other stuff odesnt. [22:48:43] other stuff = ssh into stat1002/stat1003 [22:48:58] robh: if you want another set of eyes on the config, hi :) [22:49:30] feel free! https://phabricator.wikimedia.org/T107043 [22:49:34] will be updated in a moment [22:50:27] bearloga: update paste with the commented out output? i wonder what its attempting to use in config is all [22:50:37] i want to see it call the line that has the stat1002 entry on it [22:50:39] and havent see it it [22:50:44] so the config logic isnt hitting it [22:50:48] its hitting somethign else [22:51:40] I am 99% certain this is a config issue. [22:52:01] (03CR) 10Ottomata: [C: 032] Remove newly provisioned kafka nodes from cluster [puppet] - 10https://gerrit.wikimedia.org/r/229012 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [22:53:07] bearloga: or just see what it says its using in the output [22:53:14] and paste that line in here [22:54:35] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1504846 (10RobH) Update: Ok, we've tried a few things, and still no joy: * updated his production key, so now its not the same as labs * P1817 has a paste of his config... [22:54:42] I updated that with next step and what we've tried [22:54:54] but yea, I'd pull your entire config out and use the 3 line simple one [22:55:04] and use full ssh command line fqdn, etc... [22:55:23] cuz i think we're just going to keep parsing down your config until we know its not it, so lets just skip that ;D [22:55:42] the fact you can hit one system but not the others past it is why i say its config [22:56:52] I'd just send you mine, but I know it has bad logic [22:57:01] as i've added in more use cases, i've found problems... [22:57:19] (its why i dumped mine out for this test and used just three lines ;) [22:57:37] robh JohnFLewis https://phabricator.wikimedia.org/P1821 [22:58:00] something is wrong that your adding that id at load [22:58:08] i dunno what, but thats bad, your fix works [22:58:12] but eww ;D [22:58:52] bearloga: so you dont load your production key before you attempt to connect? [22:59:06] hm [22:59:11] but yea, just ditch all that stuff and try the three line config with full connection command [22:59:22] Who's doing SWAT? I've got a patch for it but it's still merging. [23:00:01] bearloga: I may be wrong but could you add a specific 'host bast1001.wikimedia.org' stanza which has your 'user' and 'identifyfile' lines? [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150803T2300). Please do the needful. [23:00:05] ebernhardson MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:07] robh the one at https://wikitech.wikimedia.org/wiki/SSH_access#Production ? [23:00:10] it's fixed an issue in the past [23:00:16] * MaxSem is here [23:00:37] well, the ticket i put even easier one [23:00:45] Host !bast1001.wikimedia.org *.wikimedia.org *.wmnet [23:00:45] ProxyCommand ssh -a -W %h:%p bast1001.wikimedia.org [23:00:53] ditch all the other stuff [23:01:02] and have just [23:01:05] ForwardAgent no [23:01:05] Host !bast1001.wikimedia.org *.wikimedia.org *.wmnet [23:01:06] ProxyCommand ssh -a -W %h:%p bast1001.wikimedia.org [23:01:10] Or, you can put whats on that page yes [23:01:14] * ebernhardson is here too [23:01:32] but, i dont think we should troubleshoot a large config anymore [23:01:32] (03CR) 10Mattflaschen: Enable Flow on all wikis, except private and a couple special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [23:01:45] I think you shoudl ditch it and adopt one of those two simpler one for this testing. [23:01:46] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1504874 (10CaitVirtue) I would want to leave it up for one week post-event, to allow for donations that inevitably trickle in during the days following the event, but it would be fine with me to t... [23:01:57] I agree with robh [23:02:07] it's best to work from nothing to complex than complex to nothing :) [23:02:13] MaxSem: James_F looks like i'm shipping patches today [23:02:20] As soon as this took more than 30 minutes I lost the will to make your config work ;D [23:02:32] ebernhardson: Kk. [23:02:38] its also why i dont plan to add back in my crappy config [23:02:52] (03CR) 10BryanDavis: [C: 031] "There are some Puppet style things that I'm sure will be quibbled over, but in general this looks good." [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [23:03:09] Added two last-minute Flow SWATs. [23:03:15] * robh had some odd pattern matches for old shit, using new config off wikitech now [23:03:22] matt_flaschen: everyone sneaking in :P [23:03:45] ebernhardson, I wasted a couple minutes figuring out why I couldn't commit the submodule bump... [23:03:56] If you use the 3 line config (where you do not list off your user or identify file) you'll need to ssh-add the id file [23:03:56] (03CR) 10EBernhardson: [C: 032] Enable Flow on all wikis, except private and a couple special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [23:04:02] and ssh user@stat1002... [23:04:18] Has someone volunteered to do the SWAT yet? [23:04:24] matt_flaschen: i'm doing it [23:04:25] matt_flaschen: ebernhardson. [23:04:28] (03Merged) 10jenkins-bot: Enable Flow on all wikis, except private and a couple special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [23:04:32] if you use the wikitech 7 line config https://wikitech.wikimedia.org/wiki/SSH_access#Production [23:04:40] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1504878 (10mpopov) ``` $ cat ~/.ssh/config ForwardAgent no Host !bast1001.wikimedia.org *.wikimedia.org *.wmnet ProxyCommand ssh -a -W %h:%p bast1001.wikimedia.org $ ssh... [23:04:45] then you dont have to manually load your id, it should ask i think. [23:05:08] !log ebernhardson Synchronized wmf-config/: (no message) (duration: 00m 13s) [23:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:24] bearloga: ssh bearloga@stat1002 [23:05:39] its likely defaulting to mpopov from users right? [23:05:58] matt_flaschen: not really sure how to sync out the deletion of flow.dblist [23:06:00] bearloga: let me write up a basic config that should work from scratch hopefully [23:06:09] the config he just appended is basic [23:06:17] it should work =] [23:06:20] its 3 lines [23:06:24] ebernhardson, might require a scap to do properly. Not going to break anything if it just sits there though. [23:06:30] robh: bah didn't see the new comment :) [23:06:34] no worries [23:06:45] i just think it has to specify the user in the ssh connect line [23:06:56] since his laptop user is mpopov, it assumes that. [23:07:05] Gah, mediawiki-phpunit-zend, you suck. [23:07:26] robh: The config from wikitech you linked to above wont work with latest openssh ;) [23:07:39] so for some reason, it's still trying to use id_rsa for this connection even when I explicitly load wmf_rsa [23:07:58] bearloga: specify IdentifyFile for your wmf_rsa and the User bearloga in the stanza. That should work 100% eitherway. then just ssh stat1002.eqiad.wmnet [23:08:03] bearloga: yea... your local ssh is fubar man [23:08:11] the fact it loads it on EVERY terminal launch is broken [23:08:51] bearloga: but you specify the user and it still fails? [23:08:54] robh idk if this laptop was used by a previous WMF employee but if it had, they went out with a bang [23:09:11] I would really, really, really hope they handed you a clean installed system [23:09:20] and not some odd partially configured system. [23:09:37] I also happen to wipe whatever laptop is handed to me and reisntall from scratch [23:10:18] bearloga: though the fact its loading a ssh key you created (not OIT) it makes me think you had to have changed something to make it load that all the time, no? ;] [23:11:30] also someone just pointed out to me that -v is good, but -vvv is better ;] [23:11:38] if you want even more info during the connection attempt. [23:12:02] hoo, what's up with the config on wikitech? [23:12:22] at some point I attempted to clear up that page and probably broke stuff along the way [23:12:34] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1504903 (10RobH) sorry, i should have said ssh -v bearloga@stat1003.eqiad.wmnet [23:12:53] So bearloga can SSH to bast1001 right? [23:13:03] Krenair: Host bast1001.wikimedia.org … ProxyCommand none … ControlMaster auto [23:13:06] That's not enough [23:13:18] you need to ahve the User and IdentifyFile there as well [23:13:22] no you dont [23:13:25] That changed quite recently [23:13:25] not if you load it manually [23:13:33] and put it in the connection attempt line [23:13:33] You mean using -I? [23:13:40] ssh-add [23:13:43] when you load up the terminal. [23:14:20] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1504905 (10mpopov) So this is my entire config file: ``` ForwardAgent no Host !bast1001.wikimedia.org *.wikimedia.org *.wmnet User bearloga ProxyCommand ssh -v -a -W %h... [23:14:47] bearloga, robh: To be clear, the issue is that you can SSH to bast1001 but not stat100[23] ? [23:15:02] ok [23:15:09] is that right? [23:15:13] Yes [23:15:16] Yes [23:16:06] PROBLEM - puppet last run on cp1069 is CRITICAL Puppet has 1 failures [23:16:35] bearloga: are you loading the id ? [23:16:38] tempted to wipe this laptop tonight and start with a fresh ssh tomorrow morning [23:16:38] and that's true against the latest simplified config shown as the last comment on the ticket? [23:16:41] i load mine manually when i load up [23:17:04] !log ebernhardson Synchronized php-1.26wmf16/extensions/Flow: Bump flow submodule in swat for 1.26wmf16 (duration: 00m 14s) [23:17:07] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [23:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:13] $ ssh-add -l [23:17:13] 4096 fd:d1:27:5d:5c:be:05:09:c6:60:6e:92:69:d4:08:2d /Users/mpopov/.ssh/wmf_rsa (RSA) [23:17:35] hmm [23:17:44] bearloga, what command do you run to successfully ssh to bast1001? [23:18:13] ebernhardson, are both the Flow config (all wikis) and Flow talkpage manager one deployed? [23:18:17] bearloga, just ssh bast1001.wikimedia.org? nothing else? [23:18:34] !log ebernhardson Synchronized php-1.26wmf16/extensions/WikimediaEvents/: Bump WikimediaEvents in SWAT for 1.26wmf16 (duration: 00m 12s) [23:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:48] MaxSem: WikimediaEvents is out now too, will likely take 5-ish minutes to get through the cache [23:18:51] matt_flaschen: both [23:18:57] Thanks, ebernhardson [23:19:03] weeeeee [23:19:12] bearloga: Ok... yea... we now have simple ssh config and you manually load your key and still no joy [23:19:14] $ ssh bearloga@bast1001.wikimedia.org = ok. $ ssh bast1001.wikimedia.org = dnied [23:19:21] Right. [23:19:30] Change your ssh config from this line: [23:19:32] ProxyCommand ssh -v -a -W %h:%p bast1001.wikimedia.org [23:19:34] To this line: [23:19:40] ProxyCommand ssh -v -a -W %h:%p bearloga@bast1001.wikimedia.org [23:20:16] * robh is stumped but following along (intrigued) [23:20:47] mmm, just set User bearloga ? [23:21:05] That was done for every host except bast1001, because it's in a block with the ProxyCommand [23:21:14] You do not want to attempt to proxy your ssh to bast1001... via bast1001. :) [23:21:28] User bearloga already there. Krenair changed line but ssh bast1001.wikimedia.org still denied [23:21:37] yes, that's expected [23:21:45] but try to stat100[23] now [23:21:46] !log ebernhardson Synchronized php-1.26wmf16/extensions/VisualEditor/: Bump visualeditor for swat in 1.26wmf16 (duration: 00m 13s) [23:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:52] James_F: VE patch now out [23:22:55] WE HAVE STAT100[23] ACCESS!!!!!!!!!! [23:23:14] ebernhardson: Checking now. [23:23:23] bearloga, congrats :) [23:23:37] any other tricks or just the username thing I suggested? [23:23:37] thank you everyone involved! :) [23:23:42] heh, what I said earlier? :D [23:24:50] adding User, IdentityFile, and changing to ProxyCommand ssh -v -a -W %h:%p bearloga@bast1001.wikimedia.org helped [23:25:28] Yeah, you should add a separate User ... to your Host def. for the bast as well [23:25:38] pushing it into the proxycommand is not as nice [23:26:15] ebernhardson, I see the old code even with cache busted [23:26:18] ebernhardson: Yup, looks good to me. [23:27:06] MaxSem: :S lemme double check [23:27:59] MaxSem: both are definatly in the staging dir, and the changes made it from /srv/mediawiki-staging to /srv/mediawiki on tin so it should be everywhere [23:28:08] (both WikimediaEvents patches) [23:28:32] I still see oneIn( 1000 ) [23:28:33] MaxSem: oh, cache busting with ?debug=1 doesn't work as you might think in prod. the individual assets are still cached [23:28:46] it just doesn't bundle them together [23:29:20] bleeeh [23:30:56] MaxSem: just look at them age headers ;) [23:31:07] thank you robh Krenair hoo JohnFLewis for your help [23:31:17] 6operations: Need sudo to blazegraph on wdqs1001/1002 - https://phabricator.wikimedia.org/T107819#1504948 (10Smalyshev) 3NEW a:3Joe [23:31:32] \o/ [23:31:32] 6operations, 3Discovery-Wikidata-Query-Service-Sprint: Need sudo to blazegraph on wdqs1001/1002 - https://phabricator.wikimedia.org/T107819#1504956 (10Smalyshev) [23:31:35] 6operations: Need sudo to blazegraph on wdqs1001/1002 - https://phabricator.wikimedia.org/T107819#1504958 (10Krenair) [23:31:37] bearloga: welcome. now analyse the data :D [23:31:39] now I just gotta get my labs ssh back. will ask for help if it doesn't work. [23:31:49] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint: Need sudo to blazegraph on wdqs1001/1002 - https://phabricator.wikimedia.org/T107819#1504948 (10Krenair) [23:33:07] Thanks for figuring out bearloga's access, everyone. :-) [23:35:38] robh btw, I figured out the mystery of the auto-loading key. the first time a key is generated or used (unsure), it may be added to the Mac user's login keychain. if anyone has an issue like that again, they should launch Keychain Access and delete the SSH entry. [23:35:47] ahhhh [23:39:07] RECOVERY - puppet last run on cp1069 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [23:40:30] also, huge thank you to everyone for being very understanding and helpful with explanations and directions. [23:42:16] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [23:43:46] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1505021 (10mpopov) 5Open>3Resolved [23:45:09] !log rebuilding kafka cluster [23:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:00] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1505060 (10mpopov) Huge thanks to @robh and @krenair for their help in getting this resolved. [23:55:56] RECOVERY - Kafka Broker Replica Lag on analytics1022 is OK: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value OKAY: 0.0 [23:56:46] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [23:57:46] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1021 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [23:57:46] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1018 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [23:57:46] RECOVERY - Kafka Broker Replica Lag on analytics1018 is OK: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value OKAY: 0.0