[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150917T0000). [00:02:07] RECOVERY - puppet last run on mw1098 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [00:02:57] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:04:04] 6operations, 5Patch-For-Review: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1647662 (10GWicke) I'll need to cut a newer one, as reprepo refuses 0.4.0 as already uploaded. @cscott, ready to cut a new release? [00:04:27] RECOVERY - puppet last run on mw1105 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [00:04:56] RECOVERY - puppet last run on mw1108 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:05:17] RECOVERY - puppet last run on mw1121 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:05:47] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [00:06:08] RECOVERY - puppet last run on mw1113 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [00:06:47] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:07:08] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [00:07:16] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:07:37] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:07:38] RECOVERY - puppet last run on mw1103 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [00:08:20] Krenair: sorry got distracted [00:08:24] Krenair: what's the instance name? [00:08:27] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:08:34] labs-dnsrecursor2.openstack.eqiad.wmflabs [00:08:39] you'll have to log in as root [00:09:00] (I added my key to /etc/ssh/userkeys/root already) [00:09:46] wow so it's the exact same issue [00:09:48] RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:10:14] yuvipanda, I worked around the problem [00:10:48] Modified /etc/resolv.conf to point to the instance's own IP (since it's running a recursive dns server pointing to the real labs-recursor0) [00:11:03] haha [00:11:03] I see [00:11:04] ok [00:11:08] so that works, but wwwhhyyyy [00:13:13] oh wait, maybe I set it to go to labs-ns2 [00:13:23] (03PS1) 10Dzahn: squid: logrotate for webproxy on carbon [puppet] - 10https://gerrit.wikimedia.org/r/239009 (https://phabricator.wikimedia.org/T97119) [00:16:02] well, either way... [00:19:08] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3035_v6, cp3046_v6 [00:20:57] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [00:21:23] 6operations, 5Patch-For-Review, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1647684 (10Slaporte) We are in the process of transferring the domain. I'll confirm or renew. [00:24:14] (03CR) 10Dduvall: [C: 032] Fix logging output from sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/238858 (owner: 10Chad) [00:24:39] (03Merged) 10jenkins-bot: Fix logging output from sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/238858 (owner: 10Chad) [00:28:18] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [00:34:47] yuvipanda, got it all working, just need to figure out how to ensure that the parent directory of my zone file is created [00:35:03] Krenair: woo! [00:35:06] Krenair: yeah tha's a bit of a shitshow [00:35:16] you need to ensure directory and set a require => File['parentdir'] [00:35:46] on the file itself? [00:36:16] Krenair: the require on the file, yes. the parent dir requires another file { [00:36:37] 6operations, 10Wikimedia-Mailing-lists: Maps-l: Disable or re-assign moderators - https://phabricator.wikimedia.org/T110962#1647721 (10Yurik) Wouldn't we loose all subscribers of that list? I feel that maps clients will mostly want a "notification of changes" low traffic mailing list. So how about we keep it... [00:40:07] yuvipanda, that is indeed a bit shit [00:40:22] Krenair: yup [00:40:27] no equivalent of mkdir -p [00:40:28] unfortunately [00:40:32] (03PS1) 10Dduvall: Support batch size configuration per stage [tools/scap] - 10https://gerrit.wikimedia.org/r/239016 (https://phabricator.wikimedia.org/T112841) [00:40:49] Krenair: usually the package might setup somewhere you can drop these files to [00:40:54] (03PS3) 10Alex Monk: Move *.labsdb aliases into DNS [puppet] - 10https://gerrit.wikimedia.org/r/238672 (https://phabricator.wikimedia.org/T63897) [00:41:19] (03PS4) 10Alex Monk: Move *.labsdb aliases into DNS [puppet] - 10https://gerrit.wikimedia.org/r/238672 (https://phabricator.wikimedia.org/T63897) [00:42:58] Krenair: woo! minor things - I think your data list is missing c1, c2, c3 themselves. [00:43:10] true [00:43:17] Krenair: also values for 'ensure =>' are always unquoted [00:44:38] (03PS5) 10Alex Monk: Move *.labsdb aliases into DNS [puppet] - 10https://gerrit.wikimedia.org/r/238672 (https://phabricator.wikimedia.org/T63897) [00:46:23] Krenair: woo. it looks good to me, but I'll wait for andrewbogott or bblack to take a look [00:48:16] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v6, cp3045_v6 [00:50:06] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [00:53:52] (03PS2) 10Dzahn: squid: logrotate for webproxy on carbon [puppet] - 10https://gerrit.wikimedia.org/r/239009 (https://phabricator.wikimedia.org/T97119) [00:55:08] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:57:36] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3042_v6 [00:59:17] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:04:46] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp4006_v6 [01:06:29] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:07:56] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3047_v6 [01:09:38] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:13:37] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3034_v6 [01:14:57] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3049_v6 [01:15:18] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:18:27] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:21:47] 6operations, 10Wikimedia-Mailing-lists: Maps-l: Disable or re-assign moderators - https://phabricator.wikimedia.org/T110962#1647830 (10Dzahn) >>! In T110962#1647721, @Yurik wrote: > Wouldn't we loose all subscribers of that list? There are 237 on that list. Technically we could mass subscribe them to the new... [01:23:29] (03PS1) 10Alex Monk: Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/239021 [01:23:38] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: Puppet has 1 failures [01:23:46] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Puppet has 1 failures [01:28:56] yuvipanda, it's all on labs-dnsrecursor2.openstack.eqiad.wmflabs if you want to test [01:29:12] Krenair: \o/ cool. [01:29:20] Krenair: it'll probably be merged tomorrow one way or other. [01:29:25] also don't want to merge and just leave now :D [01:29:37] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2014_v6, cp4014_v6 [01:30:34] there's local hacks there to add my key to /var/lib/git/labs/private/files/ssh/root-authorized-keys and to /var/lib/git/operations/puppet/modules/base/templates/resolv.conf.labs.erb to fix the instance's own DNS [01:31:28] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp4020_v6 [01:32:16] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3033_v6 [01:32:37] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1647833 (10Dzahn) @Jalexander 2 more days.. [01:33:57] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [01:34:03] (03CR) 10Dzahn: [C: 032] Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/239021 (owner: 10Alex Monk) [01:34:58] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [01:36:48] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:36:48] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp4007_v6 [01:38:37] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:39:03] (03PS2) 10Dzahn: Add MAC address entries for restbase200[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/238859 (https://phabricator.wikimedia.org/T112683) (owner: 10Papaul) [01:39:17] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3038_v6 [01:39:57] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2020_v6 [01:41:06] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [01:43:28] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:45:16] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 connecting: (unnamed) not-conn: cp2003_v6 [01:45:23] (03PS3) 10Dzahn: Add MAC address entries for restbase200[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/238859 (https://phabricator.wikimedia.org/T112683) (owner: 10Papaul) [01:45:47] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp2008_v6 [01:45:50] (03CR) 10Dzahn: [C: 032] "amended to fix trailing whitespace, actually checked all MACs on consoles" [puppet] - 10https://gerrit.wikimedia.org/r/238859 (https://phabricator.wikimedia.org/T112683) (owner: 10Papaul) [01:46:57] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [01:48:44] Why are our DC cities so similarly named? Dallas and Dulles [01:48:56] (nearby anyhow) [01:49:06] (03CR) 10Dzahn: "maybe all the links and comments that are now on this gerrit change could be copied over to a phabricator pastebin and then link that in t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari) [01:50:27] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [01:50:57] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 connecting: (unnamed) not-conn: cp3036_v6, cp4014_v6 [01:52:47] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:53:27] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3034_v6, cp3047_v6 [01:56:17] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:56:57] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [01:58:13] 6operations, 10ops-codfw, 5Patch-For-Review: setup/install/deploy new HP restbase servers for codfw - https://phabricator.wikimedia.org/T112683#1647861 (10Papaul) Thank you Rob I will start to work on production DNS now. [02:05:57] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2005_v6, cp3049_v6 [02:09:27] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [02:14:15] (03PS1) 10Dzahn: let non-root users also use bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/239023 [02:18:36] RECOVERY - Hadoop DataNode on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [02:22:26] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2009_v6 [02:22:27] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp3035_v6 [02:23:47] PROBLEM - Hadoop DataNode on analytics1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [02:24:07] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [02:24:37] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp3039_v6 [02:25:57] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 57 not-conn: cp4008_v6 [02:25:57] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [02:26:26] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [02:27:07] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3042_v6 [02:28:46] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:29:37] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 58 ESP OK [02:30:36] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10693 bytes in 1.079 second response time [02:32:02] (03PS1) 10Papaul: Add production DNS for restbase200[1-6] Bug:T112683 [dns] - 10https://gerrit.wikimedia.org/r/239024 (https://phabricator.wikimedia.org/T112683) [02:33:17] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3044_v6 [02:36:48] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [02:37:56] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [02:38:56] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 7 connecting: cp1047_v6 [02:39:04] !log l10nupdate@tin Synchronized php-1.26wmf22/cache/l10n: l10nupdate for 1.26wmf22 (duration: 11m 11s) [02:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:39] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 8 ESP OK [02:40:46] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 57 not-conn: cp2017_v6, cp2020_v6, cp4006_v6 [02:45:48] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf22) at 2015-09-17 02:45:48+00:00 [02:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:47:46] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [02:47:46] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3037_v6 [02:50:08] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:51:56] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10693 bytes in 0.079 second response time [02:53:48] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2014_v6, cp4015_v6 [02:54:57] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3038_v6 [02:55:37] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [02:56:46] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [02:58:28] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [03:03:21] !log l10nupdate@tin Synchronized php-1.26wmf23/cache/l10n: l10nupdate for 1.26wmf23 (duration: 06m 30s) [03:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:06] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:05:46] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 no-child-sa: cp3039_v6 [03:06:33] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf23) at 2015-09-17 03:06:33+00:00 [03:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:07:27] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [03:22:37] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:28:07] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3047_v6 [03:29:58] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [03:31:35] 6operations, 10Wikimedia-Site-Requests: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1647882 (10Keegan) I've emailed someone from Wikimedia Estonia as well, a little over a week ago or so. [03:40:53] 6operations, 10Wikimedia-Site-Requests: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1647893 (10Kaarel_Vaidla) All Wikimedia Eesti has been busy with organizing Wikimedia CEE Meeting 2015 (https://meta.wikimedia.org/wiki/Wikimedia_CEE_Meeting_2015) in recent times and therefore there has be... [03:58:48] PROBLEM - Apache HTTP on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1517 bytes in 0.046 second response time [04:00:15] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1647902 (10Krenair) 5stalled>3Open a:3Krenair [04:00:18] 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1647905 (10Krenair) [04:00:33] (03Restored) 10Alex Monk: Create ee.wikimedia.org for renaming from et.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/234426 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [04:00:37] (03Restored) 10Alex Monk: Add ee.wikimedia.org to apache config for chapters [puppet] - 10https://gerrit.wikimedia.org/r/234427 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [04:00:47] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.047 second response time [04:04:47] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:09:37] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1015.67 Read Requests/Sec=788.04 Write Requests/Sec=512.96 KBytes Read/Sec=6966.83 KBytes_Written/Sec=7231.36 [04:11:18] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=3.00 Read Requests/Sec=0.00 Write Requests/Sec=0.40 KBytes Read/Sec=0.00 KBytes_Written/Sec=1.60 [04:20:38] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:25:30] (03CR) 10Santhosh: [C: 031] CX: Enable Suggestions in ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238097 (https://phabricator.wikimedia.org/T111901) (owner: 10KartikMistry) [04:53:47] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Puppet has 1 failures [05:02:36] (03PS2) 10Dzahn: let non-root users also use bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/239023 [05:04:39] (03PS3) 10Dzahn: let non-root users also use bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/239023 [05:06:42] (03PS4) 10Dzahn: let non-root users also use bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/239023 [05:10:47] (03PS1) 1020after4: A context manager for managing nested loggers [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 [05:11:05] (03CR) 10jenkins-bot: [V: 04-1] A context manager for managing nested loggers [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [05:12:12] (03PS3) 10Dzahn: Create ee.wikimedia.org for renaming from et.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/234426 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [05:13:51] (03CR) 10Dzahn: [C: 031] "per https://phabricator.wikimedia.org/T31919#1647893" [dns] - 10https://gerrit.wikimedia.org/r/234426 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [05:18:15] (03PS2) 1020after4: A context manager for managing nested loggers [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 [05:20:00] 6operations, 5Patch-For-Review, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1648000 (10Dzahn) @slaporte thanks! let's talk some time about which domains we should add to our own DNS or not. we can add domains as "parked" domains that don't get any traffic if that's desi... [05:20:58] 6operations, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1648001 (10Dzahn) [05:31:59] (03CR) 10Luke081515: [C: 031] Create ee.wikimedia.org for renaming from et.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/234426 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [05:32:42] (03CR) 10Luke081515: [C: 031] Add ee.wikimedia.org to apache config for chapters [puppet] - 10https://gerrit.wikimedia.org/r/234427 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [05:38:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [05:47:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [05:47:47] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Sep 17 05:47:47 UTC 2015 (duration 47m 46s) [05:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:57:23] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] 3.6.5+dfsg1-1+wm[3..5]: backport of updates for D40473 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/238787 (owner: 10Ori.livneh) [06:04:43] (03PS2) 10Giuseppe Lavagetto: hhvm (3.6.5+dfsg1-1+wm6) urgency=medium [debs/hhvm] - 10https://gerrit.wikimedia.org/r/238408 (https://phabricator.wikimedia.org/T112640) [06:17:47] (03PS2) 10Kaldari: Adding comment on disabling anon page creation on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 [06:18:46] RECOVERY - Hadoop DataNode on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [06:23:56] PROBLEM - Hadoop DataNode on analytics1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [06:25:49] (03PS1) 10Muehlenhoff: Enable ferm on mw1209-mw1220 [puppet] - 10https://gerrit.wikimedia.org/r/239029 [06:29:38] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: puppet fail [06:30:17] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:18] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:38] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:08] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:56] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:56] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:07] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:17] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:38] !log depooling mw1209-mw1220 (in two steps) [06:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:36:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1209-mw1220 [puppet] - 10https://gerrit.wikimedia.org/r/239029 (owner: 10Muehlenhoff) [06:38:16] PROBLEM - puppet last run on mw1209 is CRITICAL: Timeout while attempting connection [06:39:48] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:39:56] PROBLEM - puppet last run on mw1212 is CRITICAL: Timeout while attempting connection [06:39:56] PROBLEM - puppet last run on mw1218 is CRITICAL: Timeout while attempting connection [06:39:56] PROBLEM - puppet last run on mw1216 is CRITICAL: Timeout while attempting connection [06:40:16] PROBLEM - puppet last run on mw1211 is CRITICAL: Timeout while attempting connection [06:41:28] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:41:36] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:41:36] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:41:56] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:42:06] PROBLEM - puppet last run on mw1220 is CRITICAL: Timeout while attempting connection [06:43:07] PROBLEM - Apache HTTP on mw1220 is CRITICAL: Connection timed out [06:43:26] PROBLEM - HHVM rendering on mw1220 is CRITICAL: Connection timed out [06:43:46] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [06:44:26] (03PS3) 10DCausse: Cirrus: set /langdetect/short-text/ the default langdetect profile [puppet] - 10https://gerrit.wikimedia.org/r/234297 (https://phabricator.wikimedia.org/T110077) [06:44:46] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.029 second response time [06:45:06] RECOVERY - HHVM rendering on mw1220 is OK: HTTP OK: HTTP/1.1 200 OK - 64675 bytes in 0.097 second response time [06:45:09] !log repooled mw1209-mw1220 with ferm enabled [06:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:51:02] _joe_: morning, I need to deploy config changes to elastic, but I just realized that I can't +2 on puppet [06:51:34] would you mind having a look and +2 to https://gerrit.wikimedia.org/r/#/c/224651/ and https://gerrit.wikimedia.org/r/#/c/234297/ ? [06:52:46] <_joe_> dcausse: I can take a look, yes [06:52:56] (03PS1) 10Muehlenhoff: Enable ferm on mw1161-1169 [puppet] - 10https://gerrit.wikimedia.org/r/239032 [06:52:57] thanks [06:53:00] 6operations, 7Swift: upload.wikimedia.org needs a Wikimedia 404 error page - https://phabricator.wikimedia.org/T37053#1648120 (10faidon) [06:53:51] <_joe_> dcausse: 224651 still has your -1 [06:53:55] (03CR) 10DCausse: [C: 031] Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [06:54:04] _joe_: done, sorry [06:55:38] <_joe_> dcausse: why removing all the groovy sandboxing options here? https://gerrit.wikimedia.org/r/#/c/224651/8/modules/elasticsearch/templates/elasticsearch.yml.erb [06:56:27] 6operations, 7Swift: upload.wikimedia.org needs a Wikimedia 404 error page - https://phabricator.wikimedia.org/T37053#1648144 (10faidon) upload requests are being (ultimately) served by Swift, no nginx involved in that layer. Swift doesn't have a mechanism to provide an error page, however it should be possibl... [06:56:33] _joe_: because we'll remove groovy support so it's useless now [06:56:38] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:56:39] <_joe_> oh it's related to dynamic scripting, right [06:56:46] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:47] <_joe_> yeah I realized it a second later, sorry [06:56:48] <_joe_> :P [06:56:52] :) [06:57:01] !log depooled mw1161-1168 (T104968) [06:57:07] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:57:21] (03PS9) 10Giuseppe Lavagetto: Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [06:57:27] <_joe_> dcausse: ok, let's go [06:57:27] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:57:36] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:57:37] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:57:48] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:48] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:57:57] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:18] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:19] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1161-1169 [puppet] - 10https://gerrit.wikimedia.org/r/239032 (owner: 10Muehlenhoff) [06:58:26] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:58:59] (03CR) 10Giuseppe Lavagetto: [C: 032] Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [06:59:07] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:08] (03PS10) 10Giuseppe Lavagetto: Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [06:59:08] finally [06:59:37] (03CR) 10Giuseppe Lavagetto: [V: 032] Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [06:59:57] <_joe_> dcausse: this won't auto-restart ES, right? [07:00:04] dcausse: Dear anthropoid, the time has come. Please deploy ElasticSearch plugins upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150917T0700). [07:00:12] _joe_: no [07:00:46] (03CR) 10DCausse: [C: 032 V: 032] Cirrus: add language detector plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/234283 (https://phabricator.wikimedia.org/T110077) (owner: 10DCausse) [07:01:17] <_joe_> dcausse: you want both the patches toghether I guess, then do a rolling restart, right? [07:01:28] _joe_: yes [07:01:32] <_joe_> wait ~ 30 mins for puppet changes to take effect though [07:01:48] (03PS4) 10Giuseppe Lavagetto: Cirrus: set /langdetect/short-text/ the default langdetect profile [puppet] - 10https://gerrit.wikimedia.org/r/234297 (https://phabricator.wikimedia.org/T110077) (owner: 10DCausse) [07:01:58] (03CR) 10Giuseppe Lavagetto: [C: 032] Cirrus: set /langdetect/short-text/ the default langdetect profile [puppet] - 10https://gerrit.wikimedia.org/r/234297 (https://phabricator.wikimedia.org/T110077) (owner: 10DCausse) [07:02:22] _joe_: thanks! [07:02:56] (03CR) 10DCausse: [C: 032 V: 032] Upgrade to extra plugin 1.7.1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/238105 (https://phabricator.wikimedia.org/T112499) (owner: 10DCausse) [07:05:07] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1648168 (10faidon) Are these for the "high-traffic" LVSes? If so, yes, no objections. JFYI, these are our last row B 10G ports. [07:07:25] !log repooled mw1161-1168 (T104968) [07:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:14:56] <_joe_> !log uploading new HHVM package [07:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:15:52] (03PS1) 10Filippo Giunchedi: thumbstats: add singleton Filter docs [software] - 10https://gerrit.wikimedia.org/r/239036 [07:16:12] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] thumbstats: add singleton Filter docs [software] - 10https://gerrit.wikimedia.org/r/239036 (owner: 10Filippo Giunchedi) [07:21:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [07:24:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [07:25:09] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1648191 (10Keegan) >>! In T31919#1647893, @Kaarel_Vaidla wrote: > All Wikimedia Eesti has been busy with organizing Wikimedia CEE Meeting 2015 (https://meta.wikimedia.org/wiki/Wikimedia... [07:25:26] Krenair: yay [07:26:41] (03PS1) 10Muehlenhoff: Enable ferm on mw1170-mw1179 [puppet] - 10https://gerrit.wikimedia.org/r/239038 [07:27:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [07:27:44] !log depooled mw1170-mw1179 (T104968) [07:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:28:20] (03Abandoned) 10Filippo Giunchedi: partially enable outbound SMTP STARTTLS support [puppet] - 10https://gerrit.wikimedia.org/r/160632 (owner: 10Filippo Giunchedi) [07:29:16] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1170-mw1179 [puppet] - 10https://gerrit.wikimedia.org/r/239038 (owner: 10Muehlenhoff) [07:30:17] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:30:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [07:36:08] !log elastic in eqiad plugin updates: freezing indices [07:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:36:57] !log repooled mw1170-mw1179 (T104968) [07:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:37:26] good morning [07:41:13] (03PS1) 10Muehlenhoff: Enable ferm on mw1180-mw1188 [puppet] - 10https://gerrit.wikimedia.org/r/239039 [07:42:03] !log depooled mw1180-mw1188 (T104968) [07:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:42:28] !log elastic in eqiad plugin updates: restarting elastic1001 [07:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:43:11] (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/226910 (owner: 10Hashar) [07:44:27] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1180-mw1188 [puppet] - 10https://gerrit.wikimedia.org/r/239039 (owner: 10Muehlenhoff) [07:48:25] (03PS3) 10Addshore: Rsync api log archives from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://phabricator.wikimedia.org/T112744) [07:48:30] (03PS4) 10Addshore: Rsync api log archives from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://phabricator.wikimedia.org/T112744) [07:49:51] !log repooled mw1180-mw1188 (T104968) [07:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:51:05] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [07:54:10] 6operations, 10ops-codfw: ms-be2006.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T112242#1648212 (10fgiunchedi) 5Open>3Resolved rebuilding [08:00:09] (03PS1) 10Muehlenhoff: Enable ferm on mw1221-mw1229 [puppet] - 10https://gerrit.wikimedia.org/r/239041 [08:00:11] (03PS1) 10Muehlenhoff: Enable ferm on mw1230-mw1235 [puppet] - 10https://gerrit.wikimedia.org/r/239042 [08:05:12] !log eqiad-codfw -> eqiad-eqord-codfw migration [08:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:09:55] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (will be: cr1-eqord:xe-1/0/0, IC-314533) {#3658} [10Gbps DWDM]BR [08:12:58] (03PS3) 1020after4: A context manager for managing nested loggers [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 [08:17:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 [08:18:16] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [08:20:59] (03PS1) 10Zfilipin: WIP Extract colordiff to contint::packages::colordiff [puppet] - 10https://gerrit.wikimedia.org/r/239044 (https://phabricator.wikimedia.org/T112821) [08:38:24] someone is aware of the problem on test.m.wikipedia.org? PoolCounterClient_body.php","line":71,"message":"PHP Warning: unable to connect to 10.64.0.179 ? [08:39:16] PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 100% [08:39:50] hm [08:40:15] RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 39.06 ms [08:40:21] rigel ? [08:40:28] and recovery ? [08:40:28] frack [08:40:35] I'm flapping eqiad-codfw [08:40:41] the pfw ipsec tunnel probably got unhappy [08:40:55] dcausse: I think poolcounter was added in test just yesterday [08:41:05] so people are probably aware [08:41:20] ok [08:45:06] PROBLEM - puppet last run on mw2147 is CRITICAL: CRITICAL: Puppet has 1 failures [08:46:46] <_joe_> dcausse: yes it's me sorry [08:46:52] <_joe_> I didn't see your message [08:47:04] <_joe_> dcausse: you're seeing that in logstash, right? [08:47:12] <_joe_> I'll stop it now [08:47:20] no it's a ticket from the mobile team [08:47:33] <_joe_> uhm can you point me to it? [08:47:38] (03PS1) 10Faidon Liambotis: Change interface in PTRs for eqord-eqiad link [dns] - 10https://gerrit.wikimedia.org/r/239046 [08:47:40] T112834 [08:47:52] <_joe_> thanks [08:47:56] (03CR) 10Faidon Liambotis: [C: 032] Change interface in PTRs for eqord-eqiad link [dns] - 10https://gerrit.wikimedia.org/r/239046 (owner: 10Faidon Liambotis) [08:48:16] (03CR) 10Hashar: "It is too narrow. Maybe we can get the very basic utilities in a contint::packages::base that will be applied on any slaves." [puppet] - 10https://gerrit.wikimedia.org/r/239044 (https://phabricator.wikimedia.org/T112821) (owner: 10Zfilipin) [08:49:10] <_joe_> dcausse: also, I'd expect it to connect to the next server [08:50:33] _joe_: I don't know how test.m is configured, is it supposed to access elastic in beta? [08:52:08] !log elastic in eqiad plugin updates: index warmer queries are outdated with inline groovy script, updating warmers on warwiki first to test [08:52:16] <_joe_> dcausse: no, it's completely running on prod [08:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:40] <_joe_> !log removed iptables rules for dropping traffic to helium on mw1017 [08:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:54:05] _joe_: thanks, working now [08:54:45] <_joe_> dcausse: yes, I still don't understand why failure to connect to one host should result in a failure, instead of retrying on the other one. I'll take a better look at the code [08:55:03] <_joe_> because my understanding was that it would be retried on the next server instead. [08:55:39] yes, I agree [08:56:57] (03PS2) 10Filippo Giunchedi: Add production DNS for restbase200[1-6] Bug:T112683 [dns] - 10https://gerrit.wikimedia.org/r/239024 (https://phabricator.wikimedia.org/T112683) (owner: 10Papaul) [08:57:14] (03PS3) 10Filippo Giunchedi: Add production DNS for restbase200[1-6] [dns] - 10https://gerrit.wikimedia.org/r/239024 (https://phabricator.wikimedia.org/T112683) (owner: 10Papaul) [08:57:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add production DNS for restbase200[1-6] [dns] - 10https://gerrit.wikimedia.org/r/239024 (https://phabricator.wikimedia.org/T112683) (owner: 10Papaul) [08:57:55] !log adjusting OSPF weights to be latency-based across the US network [08:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:58:18] !log penalizing ulsfo-eqiad direct MPLS links to higher OSPF weights [08:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:19] !log elastic in eqiad plugin updates: updating warmers on all wikis [09:01:26] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [09:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:09:33] (03PS2) 10Zfilipin: nodepool: extract basic utilites needed for all Jenkins slaves into a separate manifest [puppet] - 10https://gerrit.wikimedia.org/r/239044 (https://phabricator.wikimedia.org/T112821) [09:10:05] RECOVERY - puppet last run on mw2147 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:10:12] (03PS3) 10Zfilipin: contint: extract basic utilities needed for all Jenkins slaves into a separate manifest [puppet] - 10https://gerrit.wikimedia.org/r/239044 (https://phabricator.wikimedia.org/T112821) [09:12:48] any ops has some spare cycles to please merge in a few contint related puppet patches? They are all cherry picked on the labs puppet master already [09:13:22] zeljkof: sounds good [09:13:49] hashar: yeah! :D [09:14:12] small step for a man, but big for a puppet, or something... [09:14:31] <_joe_> hashar: what about watting for puppetswat? [09:16:01] cause it is at the end of the day and we start having conflicts ? :D [09:16:09] though we could chain them all as dependent patches [09:16:20] and puppet swat the whole dep chain [09:16:26] <_joe_> that's what people usually do [09:16:44] <_joe_> if you put them in the deployments calendar, I will take a look after lunch [09:17:45] zeljkof: I am going to build a dependency chain of all the pending puppet patches we have, this way we can keep working on the tip of that chain and avoid conflict [09:17:48] _joe_: thank you! [09:18:04] hashar: sounds good [09:18:13] <_joe_> hashar: no guarantees on the result of my reviews though ;) [09:19:18] _joe_: I am not too worried :} [09:21:27] (03CR) 10Hashar: contint: remove obsolete ruby related packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/238436 (owner: 10Hashar) [09:23:17] jynus: can https://phabricator.wikimedia.org/T111455 be added to next deployement window? [09:23:18] (03CR) 10Zfilipin: contint: remove obsolete ruby related packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/238436 (owner: 10Hashar) [09:23:25] (03PS1) 10Filippo Giunchedi: install_server: add restbase200[1-6] to netboot [puppet] - 10https://gerrit.wikimedia.org/r/239051 (https://phabricator.wikimedia.org/T112683) [09:24:29] mafk, I do not handle deployments [09:24:42] RelEng does [09:24:47] jynus: ok, sorry [09:25:56] (03PS5) 10Hashar: contint: Install chromedriver for running MW-Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/223691 (https://phabricator.wikimedia.org/T103039) (owner: 10Dduvall) [09:25:58] (03PS3) 10Hashar: contint: upgrade setuptools from pypi [puppet] - 10https://gerrit.wikimedia.org/r/234254 (https://phabricator.wikimedia.org/T110506) [09:26:00] (03PS2) 10Hashar: contint: remove subversion::client [puppet] - 10https://gerrit.wikimedia.org/r/238442 [09:26:02] (03PS2) 10Hashar: contint: remove obsolete ruby related packages [puppet] - 10https://gerrit.wikimedia.org/r/238436 [09:26:04] (03PS2) 10Hashar: contint: remove postgresql [puppet] - 10https://gerrit.wikimedia.org/r/238438 [09:26:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: add restbase200[1-6] to netboot [puppet] - 10https://gerrit.wikimedia.org/r/239051 (https://phabricator.wikimedia.org/T112683) (owner: 10Filippo Giunchedi) [09:28:17] (03PS4) 10Hashar: contint: extract basic utilities needed for all Jenkins slaves into a separate manifest [puppet] - 10https://gerrit.wikimedia.org/r/239044 (https://phabricator.wikimedia.org/T112821) (owner: 10Zfilipin) [09:32:47] (03PS1) 10Hoo man: Set 'repoConceptBaseUri' for all Wikibase clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239054 (https://phabricator.wikimedia.org/T112737) [09:32:56] (03CR) 10Jcrespo: [C: 031] "+1, consensus verified and patch reflects consensus." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235900 (https://phabricator.wikimedia.org/T111455) (owner: 10MarcoAurelio) [09:33:37] (03CR) 10Zfilipin: [C: 031] contint: remove obsolete ruby related packages [puppet] - 10https://gerrit.wikimedia.org/r/238436 (owner: 10Hashar) [09:33:42] mafk, I gave you +1, but I would still take into account RelEng (do not know the rules to deploy those kind of changes) [09:33:47] (03CR) 10Zfilipin: [C: 031] contint: remove subversion::client [puppet] - 10https://gerrit.wikimedia.org/r/238442 (owner: 10Hashar) [09:33:53] (03CR) 10Zfilipin: [C: 031] contint: remove postgresql [puppet] - 10https://gerrit.wikimedia.org/r/238438 (owner: 10Hashar) [09:34:08] jynus: thanks [09:36:32] (03PS1) 10Faidon Liambotis: Switch Level3's Dallas recursors to codfw [dns] - 10https://gerrit.wikimedia.org/r/239055 [09:36:34] (03PS1) 10Faidon Liambotis: Put west coast US/Canada traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/239056 (https://phabricator.wikimedia.org/T98026) [09:37:12] (03PS5) 10Hashar: contint: move some useful packags to a new base class [puppet] - 10https://gerrit.wikimedia.org/r/239044 (https://phabricator.wikimedia.org/T112821) (owner: 10Zfilipin) [09:37:38] (03CR) 10Faidon Liambotis: [C: 032] Switch Level3's Dallas recursors to codfw [dns] - 10https://gerrit.wikimedia.org/r/239055 (owner: 10Faidon Liambotis) [09:37:59] (03CR) 10Faidon Liambotis: [C: 032] Put west coast US/Canada traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/239056 (https://phabricator.wikimedia.org/T98026) (owner: 10Faidon Liambotis) [09:39:03] !log repooling ulsfo US-West traffic back to ulsfo for the first time since May :) [09:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:39:20] !log elastic in eqiad plugin updates: deleting warmers manually for old unused indices (eswikisource_content_1415240352, ruwiki_content_1415302164, thwiki_content_1415318677). We will have to remove these indices. [09:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:41:15] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [09:45:07] 6operations, 10ops-codfw, 5Patch-For-Review: setup/install/deploy new HP restbase servers for codfw - https://phabricator.wikimedia.org/T112683#1648487 (10fgiunchedi) DNS merged, now debugging an issue where `sda` shows up as 1gb drive, possibly a virtual drive from ilo ``` ~ # cat /sys/block/sda/device/mod... [09:46:15] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [09:48:00] !log elastic in eqiad plugin updates: no more groovy in warmers, waiting for few more shards to move in elastic1001 and will unfreeze indices to test warmers [09:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:15] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [09:56:15] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [09:58:32] (03CR) 10Aude: "looks sane, though we don't have all of these concept uris to actually resolve to anything on the test wikis." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239054 (https://phabricator.wikimedia.org/T112737) (owner: 10Hoo man) [10:00:21] !log elastic in eqiad plugin updates: unfreezing indices [10:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:01:15] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:04:51] (03CR) 10Hoo man: [C: 032] Set 'repoConceptBaseUri' for all Wikibase clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239054 (https://phabricator.wikimedia.org/T112737) (owner: 10Hoo man) [10:04:58] (03Merged) 10jenkins-bot: Set 'repoConceptBaseUri' for all Wikibase clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239054 (https://phabricator.wikimedia.org/T112737) (owner: 10Hoo man) [10:05:42] !log hoo@tin Synchronized wmf-config/: Set 'repoConceptBaseUri' for all Wikibase clients (duration: 00m 13s) [10:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:15] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:06:44] (03PS1) 10Muehlenhoff: Enable ferm on mw1240 - mw1249 [puppet] - 10https://gerrit.wikimedia.org/r/239062 [10:06:46] (03PS1) 10Muehlenhoff: Enable ferm on mw1250 - mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/239063 [10:07:50] <_joe_> dcausse: do you have any idea in which file should I find the error line reported in https://phabricator.wikimedia.org/T112834 ? [10:08:58] _joe_: no... I tried, this is something I searched in /a/mw-log/CirrusSearch.log but I didn't see anything :/ [10:09:25] <_joe_> I hate that we don't log all errors to the local disk too [10:09:34] yep [10:10:08] <_joe_> that is from exception-json btw [10:11:15] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:11:58] !log elastic in eqiad plugin updates: restarting elastic1002 [10:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:12:48] !log depooled mw1240-mw1249 (T104968) [10:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:15:23] (03PS2) 10Muehlenhoff: Enable ferm on mw1240 - mw1249 [puppet] - 10https://gerrit.wikimedia.org/r/239062 [10:15:41] (03PS1) 10Giuseppe Lavagetto: poolcounter: use codfw hosts in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239064 [10:15:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1240 - mw1249 [puppet] - 10https://gerrit.wikimedia.org/r/239062 (owner: 10Muehlenhoff) [10:16:15] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:16:47] (03CR) 10Giuseppe Lavagetto: [C: 032] poolcounter: use codfw hosts in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239064 (owner: 10Giuseppe Lavagetto) [10:18:38] !log oblivian@tin Synchronized wmf-config/PoolCounterSettings-codfw.php: Use codfw poolcounters in codfw (duration: 00m 12s) [10:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:19:58] <_joe_> !log experimenting with poolcounter issues on subra [10:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:15] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:24:51] !log repooled mw1240-mw1249 (T104968) [10:24:58] <_joe_> !log killing temporarily subra [10:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:15] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:27:32] !log depooled mw1250-mw1258 (T104968) [10:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:29:44] (03PS2) 10Muehlenhoff: Enable ferm on mw1250 - mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/239063 [10:30:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1250 - mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/239063 (owner: 10Muehlenhoff) [10:31:06] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:35:18] !log repooled mw1250-mw1258 (T104968) [10:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:36:13] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:39:45] (03PS1) 10Aklapper: Make Phabricator's license footer cover uploads / "content" [puppet] - 10https://gerrit.wikimedia.org/r/239067 [10:40:37] (03CR) 10JanZerebecki: "Please do." [puppet] - 10https://gerrit.wikimedia.org/r/230483 (https://phabricator.wikimedia.org/T97195) (owner: 10Smalyshev) [10:41:13] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:45:24] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1648597 (10JanZerebecki) @Paladox Does it now work for you without the workaround? [10:46:13] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:47:47] 6operations, 10Wikimedia-Extension-setup, 7Database, 5Patch-For-Review: Enable wikilove on outreachwiki - https://phabricator.wikimedia.org/T106264#1463343 (10Steinsplitter) [10:47:59] 6operations, 10Wikimedia-Extension-setup, 7Database, 5Patch-For-Review: Enable wikilove on outreachwiki - https://phabricator.wikimedia.org/T106264#1463343 (10Steinsplitter) [10:50:04] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [10:51:13] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:52:53] (03Abandoned) 10Aklapper: Make Phabricator's license footer cover uploads / "content" [puppet] - 10https://gerrit.wikimedia.org/r/239067 (owner: 10Aklapper) [10:54:00] backup4001 issue was me, indirectly [10:54:41] (03CR) 10Steinsplitter: Make Phabricator's license footer cover uploads / "content" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/239067 (owner: 10Aklapper) [10:56:13] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [10:58:12] (03PS1) 10Faidon Liambotis: Handle our own networks better [dns] - 10https://gerrit.wikimedia.org/r/239069 [10:58:14] (03PS1) 10Faidon Liambotis: Add codfw everywhere on the map [dns] - 10https://gerrit.wikimedia.org/r/239070 [10:58:16] (03PS1) 10Faidon Liambotis: Switch Middle-East's backup from ulsfo to eqiad [dns] - 10https://gerrit.wikimedia.org/r/239071 [10:58:18] (03PS1) 10Faidon Liambotis: Switch Central/South Asia to esams [dns] - 10https://gerrit.wikimedia.org/r/239072 [10:58:45] damn [10:59:27] forgot to copy the change-id [10:59:31] (03Abandoned) 10Faidon Liambotis: Switch Central/South Asia to esams [dns] - 10https://gerrit.wikimedia.org/r/80973 (owner: 10Faidon Liambotis) [10:59:35] oh well [11:00:49] (03PS1) 10Aklapper: Make Phabricator's license footer cover uploads / "content" [puppet] - 10https://gerrit.wikimedia.org/r/239073 [11:01:09] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 227 seconds ago with 0 failures [11:01:39] (03PS1) 10Muehlenhoff: Enable ferm on mw1030-mw1039 [puppet] - 10https://gerrit.wikimedia.org/r/239074 [11:01:41] (03PS1) 10Muehlenhoff: Enable ferm on mw1040-mw1049 [puppet] - 10https://gerrit.wikimedia.org/r/239075 [11:01:43] (03PS1) 10Muehlenhoff: Enable for mw1050-mw1059 [puppet] - 10https://gerrit.wikimedia.org/r/239076 [11:01:45] (03PS1) 10Muehlenhoff: Enable ferm on mw1060-mw1069 [puppet] - 10https://gerrit.wikimedia.org/r/239077 [11:03:52] !log depooled mw1030 and mw1032-mw1239 (T104968) [11:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:16] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1030-mw1039 [puppet] - 10https://gerrit.wikimedia.org/r/239074 (owner: 10Muehlenhoff) [11:18:00] !log repooled mw1030 and mw1032-mw1239 (T104968) [11:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:19] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [11:26:01] !log depooled mw1040 and mw1042-mw1049 (T104968) [11:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:46] !log typoed earlier entry: "mw1032-mw1039" instead of "mw1032-mw1239" [11:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:28:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1040-mw1049 [puppet] - 10https://gerrit.wikimedia.org/r/239075 (owner: 10Muehlenhoff) [11:28:59] (03PS1) 10Giuseppe Lavagetto: videoscaler: reimage tmh1001 as mw1259 [puppet] - 10https://gerrit.wikimedia.org/r/239078 (https://phabricator.wikimedia.org/T104747) [11:33:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:36:15] !log elastic in eqiad plugin updates: restarting elastic1003 [11:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:39:06] !log repooled mw1040 and mw1042-mw1049 (T104968) [11:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:48:54] RECOVERY - Hadoop DataNode on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [11:53:44] 6operations, 10Mathoid, 10RESTBase, 6Services: Document and hook up public mathoid end point in RB - https://phabricator.wikimedia.org/T102030#1648703 (10mobrovac) >>! In T102030#1518394, @mobrovac wrote: > [PR #303](https://github.com/wikimedia/restbase/pull/303) adds the Mathoid public API as per T103811... [11:54:13] PROBLEM - Hadoop DataNode on analytics1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [12:04:06] !log depooled mw1050-mw1059 [12:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:06:08] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable for mw1050-mw1059 [puppet] - 10https://gerrit.wikimedia.org/r/239076 (owner: 10Muehlenhoff) [12:06:45] (03CR) 10Alexandros Kosiaris: [C: 031] "puppet does not enforce a style guide on the ruby functions. I think this rule is fine" [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [12:08:46] (03CR) 10Alexandros Kosiaris: [C: 031] "Heh, never really like that notation, too perl-y IMHO. I am fine with it though if everyone else is as well" [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [12:10:22] (03CR) 10Alexandros Kosiaris: [C: 031] "Does this emit an error ? If not, and given submodules are not cloned, I see no reason to exclude them explicitly" [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [12:16:48] !log repooled mw1050-mw1059 [12:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:24:21] !log repooled mw1060 and mw1062-mw1069 (T104968) [12:24:22] There's a request to kill one of the gwtoolset jobs at commons - https://lists.wikimedia.org/pipermail/glamtools/2015-September/000520.html [12:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:24:39] !log depooled mw1060 and mw1062-mw1069 (T104968) (not repooled) [12:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:24:50] PROBLEM - puppet last run on mw2008 is CRITICAL: CRITICAL: puppet fail [12:25:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1060-mw1069 [puppet] - 10https://gerrit.wikimedia.org/r/239077 (owner: 10Muehlenhoff) [12:26:30] (03CR) 10Zfilipin: [C: 031] "There are no error messages." [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [12:30:20] (03CR) 10JanZerebecki: "Yes." [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [12:35:52] !log repooled mw1060 and mw1062-mw1069 (T104968) [12:45:12] (03PS1) 10Muehlenhoff: Enable ferm on mw1070-mw1079 [puppet] - 10https://gerrit.wikimedia.org/r/239086 [12:45:14] (03PS1) 10Muehlenhoff: Enable ferm on mw1080 - mw1089 [puppet] - 10https://gerrit.wikimedia.org/r/239087 [12:45:16] (03PS1) 10Muehlenhoff: Enable ferm on mw1090 - mw1099 [puppet] - 10https://gerrit.wikimedia.org/r/239088 [12:49:47] RECOVERY - Hadoop DataNode on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [12:49:52] !log depooled mw1070-mw1079 (T104968) [12:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:51:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1070-mw1079 [puppet] - 10https://gerrit.wikimedia.org/r/239086 (owner: 10Muehlenhoff) [12:53:36] RECOVERY - puppet last run on mw2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:55:06] PROBLEM - Hadoop DataNode on analytics1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [13:01:21] 6operations, 10ops-codfw, 5Patch-For-Review: setup/install/deploy new HP restbase servers for codfw - https://phabricator.wikimedia.org/T112683#1648844 (10fgiunchedi) the magic bios options to exclude the virtual media can be found under ``` System Configuration -> BIOS/Platform Configuration (RBSU) -> Sys... [13:01:41] !log repooled mw1070-mw1079 (T104968) [13:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:05:00] 6operations, 10Beta-Cluster, 7Performance: Need a way to simulate replication lag to test replag issues - https://phabricator.wikimedia.org/T40945#1648855 (10Reedy) T59583 is a dupe? [13:05:10] !log depooled mw1080-mw1089 (T104968) [13:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:07:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1080 - mw1089 [puppet] - 10https://gerrit.wikimedia.org/r/239087 (owner: 10Muehlenhoff) [13:07:39] 6operations: operations/software/conftool fails tox-py27-jessie - https://phabricator.wikimedia.org/T112853#1648867 (10Reedy) [13:13:08] !log repooled mw1080-mw1089 (T104968) [13:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:46] !log depooled mw1090-mw1099 (T104968) [13:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:42] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1648879 (10BBlack) Are we still good to go for this? I had planned to push this change Tues or Weds, but finally have some time to come back around to i... [13:18:02] (03PS2) 10BBlack: Remove wikidata CA cookie hacks [puppet] - 10https://gerrit.wikimedia.org/r/238418 (https://phabricator.wikimedia.org/T109072) [13:18:18] (03CR) 10BBlack: [C: 032 V: 032] Remove wikidata CA cookie hacks [puppet] - 10https://gerrit.wikimedia.org/r/238418 (https://phabricator.wikimedia.org/T109072) (owner: 10BBlack) [13:20:55] (03PS2) 10Muehlenhoff: Enable ferm on mw1090 - mw1099 [puppet] - 10https://gerrit.wikimedia.org/r/239088 [13:21:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1090 - mw1099 [puppet] - 10https://gerrit.wikimedia.org/r/239088 (owner: 10Muehlenhoff) [13:24:53] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1648900 (10jrobell) @BBlack thanks for the update. There was an email sent out to Italy donors this morning that might be affected by this deployment.... [13:26:07] 6operations, 6Commons: delete gwtoolset job by Hansmuller - https://phabricator.wikimedia.org/T112878#1648901 (10Bawolff) 3NEW [13:26:10] !log repooled mw1090-mw1099 (T104968) [13:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:24] 6operations, 10Beta-Cluster, 7Performance: Need a way to simulate replication lag to test replag issues - https://phabricator.wikimedia.org/T40945#1648924 (10Nemo_bis) >>! In T40945#1648855, @Reedy wrote: > T59583 is a dupe? That's a proposed solution, AFAIK. [13:30:31] 6operations, 10Beta-Cluster, 7Performance: Need a way to simulate replication lag to test replag issues - https://phabricator.wikimedia.org/T40945#1648925 (10Nemo_bis) [13:31:11] (03PS1) 10Muehlenhoff: Enable ferm on mw1236 - mw1239 [puppet] - 10https://gerrit.wikimedia.org/r/239093 [13:34:55] !log depooled mw1236-mw1239 (T104968) [13:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1236 - mw1239 [puppet] - 10https://gerrit.wikimedia.org/r/239093 (owner: 10Muehlenhoff) [13:42:29] (03PS1) 10Reedy: Beta cluster "test.wikipedia" thinks it is "test.wikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239097 (https://phabricator.wikimedia.org/T99156) [13:44:43] !next [13:45:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 647 [13:45:14] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 649 [13:46:01] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1648988 (10MarcoAurelio) Hi. In order to finish converting {T43492} into a project I'd like to be granted the ability to create and manage pro... [13:50:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 4741340 Threads: 1 Questions: 32135183 Slow queries: 32087 Opens: 77738 Flush tables: 2 Open tables: 64 Queries per second avg: 6.777 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [13:50:14] RECOVERY - check_mysql on lutetium is OK: Uptime: 2506466 Threads: 2 Questions: 17914469 Slow queries: 8546 Opens: 38476 Flush tables: 2 Open tables: 64 Queries per second avg: 7.147 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [13:50:36] (03CR) 10Anomie: [C: 031] Enable authmetrics logging everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238978 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [13:50:37] !log repooled mw1236-mw1239 (T104968) [13:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:51:36] (03CR) 10Luke081515: [C: 04-1] "Please look first at this, before merge:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239097 (https://phabricator.wikimedia.org/T99156) (owner: 10Reedy) [13:53:19] (03CR) 10Reedy: "What about it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239097 (https://phabricator.wikimedia.org/T99156) (owner: 10Reedy) [13:54:15] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Puppet has 1 failures [13:54:50] (03PS1) 10Giuseppe Lavagetto: Rename tmh1001 to mw1259 for reimaging [dns] - 10https://gerrit.wikimedia.org/r/239103 (https://phabricator.wikimedia.org/T104747) [13:55:04] (03CR) 10Reedy: "It should be test.wikimedia not test.wikipedia? Why hasn't the task been fixed then? ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239097 (https://phabricator.wikimedia.org/T99156) (owner: 10Reedy) [13:56:48] (03CR) 10Reedy: "But that's a different issue anyway. Both production and beta are at the test.wikipedia. So this at least makes things work correctly unti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239097 (https://phabricator.wikimedia.org/T99156) (owner: 10Reedy) [13:58:13] !log elastic in eqiad plugin updates: restarting elastic1004 [13:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:29] (03CR) 10Reedy: "I can't actually see an issue in phab about renaming it either..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239097 (https://phabricator.wikimedia.org/T99156) (owner: 10Reedy) [14:00:53] <_joe_> Reedy: ^^ it's puppetswat material I guess? [14:01:07] My patch? [14:01:13] <_joe_> yep [14:01:15] It can just be a normal swat patch as it's MW config stuff [14:01:28] <_joe_> oh it's just mw-config [14:01:31] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [100000000.0] [14:01:32] heh, yeah [14:01:35] <_joe_> I thought it was mw-config and puppet [14:01:38] <_joe_> I misread [14:01:49] renaming the wiki would be puppet/apache config stuff [14:01:55] <_joe_> I was basically enjoing the comment war [14:01:55] but this is just fixing misconfiguration [14:02:02] <_joe_> ok [14:02:39] As I said to him (and I know you're not disagreeing)... It doesn't matter if it's wanting to be renamed, this is fixing the issue at hand. If it gets "reverted" at a later date as part of a rename, that's fine [14:05:36] !log elastic in eqiad plugin updates: can't restart elastic1004 (2 timeouts when disabling replication, too much load?), waiting for more shards to rebalance... [14:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:52] (03CR) 10Giuseppe Lavagetto: [C: 032] Rename tmh1001 to mw1259 for reimaging [dns] - 10https://gerrit.wikimedia.org/r/239103 (https://phabricator.wikimedia.org/T104747) (owner: 10Giuseppe Lavagetto) [14:06:46] (03PS2) 10Giuseppe Lavagetto: videoscaler: reimage tmh1001 as mw1259 [puppet] - 10https://gerrit.wikimedia.org/r/239078 (https://phabricator.wikimedia.org/T104747) [14:07:06] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] videoscaler: reimage tmh1001 as mw1259 [puppet] - 10https://gerrit.wikimedia.org/r/239078 (https://phabricator.wikimedia.org/T104747) (owner: 10Giuseppe Lavagetto) [14:08:36] 6operations, 5Patch-For-Review: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1649104 (10Dzahn) Nice! I'll call it resolved once we see jessie on http://releases.wikimedia.org/debian/dists/ . Not entirely sure but i hope reprepro wil just create... [14:11:13] !log stopping replication and applying schema change to db1051 [14:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:11:50] What's the bot command for nextdeploy window? [14:12:09] PROBLEM - Host tmh1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:10] !next [14:12:19] jouncebot: next [14:12:19] In 0 hour(s) and 47 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150917T1500) [14:13:21] I might just deploy [14:13:28] I might not be around for SWAT [14:13:29] xD [14:14:01] <_joe_> !log reimaging tmh1001 to mw1259 [14:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:10] RECOVERY - Host tmh1001 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [14:14:12] ah, thanks for that log line [14:16:54] <_joe_> mutante: I was actually a tad late [14:17:19] i already started typing "i'll look at ..." [14:18:00] PROBLEM - configured eth on tmh1001 is CRITICAL: Connection refused by host [14:18:10] PROBLEM - dhclient process on tmh1001 is CRITICAL: Connection refused by host [14:18:21] PROBLEM - nutcracker port on tmh1001 is CRITICAL: Connection refused by host [14:18:29] PROBLEM - nutcracker process on tmh1001 is CRITICAL: Connection refused by host [14:18:31] PROBLEM - salt-minion processes on tmh1001 is CRITICAL: Connection refused by host [14:18:40] PROBLEM - puppet last run on tmh1001 is CRITICAL: Connection refused by host [14:18:50] PROBLEM - RAID on tmh1001 is CRITICAL: Connection refused by host [14:18:51] PROBLEM - DPKG on tmh1001 is CRITICAL: Connection refused by host [14:19:14] PROBLEM - Disk space on tmh1001 is CRITICAL: Connection refused by host [14:20:10] !log starting hadoop datanode on analytics1029 [14:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:33] ottomata: ^ i just replaced "namenode" with "datanode" and it worked :p [14:21:46] does this need anything else? [14:22:11] eh, Failed to start Hadoop datanode. Return value: 1 [14:22:50] !log analytics1029 - Failed to start Hadoop datanode [14:22:50] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [14:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:28] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1649116 (10BBlack) Yeah the "other 4 row B ports" I'm talking about above would be for high-traffic1 + high-traffic2 (lvs2007, 8, 10, 11 - equiv to lvs2001, 2, 4... [14:25:28] mutante: no datanodes are pretty easy, but why was it down, that is hte question!? [14:25:59] i saw the puppet not run alert a few mins ago [14:26:05] is this more network weirdness? [14:26:45] hm, datanode start not running, interseting [14:26:47] looking [14:27:06] ottomata: yes, it did not start, it failed [14:27:12] sorry, on a bus, lag [14:27:32] ottomata: replied on list [14:28:20] EXT4-fs error [14:28:23] in syslog [14:28:26] hm Caught exception while scanning /var/lib/hadoop/data/h/hdfs/dn/current. Will throw later. [14:28:26] ExitCodeException exitCode=1: du: cannot access ‘/var/lib/hadoop/data/h/hdfs/dn/current/BP-1552854784-10.64.21.110-1405114489661/current/finalized/subdir38/subdir71/blk_1109804877_36066128.meta’: Input/output error [14:28:26] du: cannot access ‘/var/lib/hadoop/data/h/hdfs/dn/current/BP-1552854784-10.64.21.110-1405114489661/current/finalized/subdir38/subdir71/blk_1109804877’: Input/output error [14:28:39] yea hm [14:29:08] kernel: [58589.417487] EXT4-fs error (device sdh1): [14:29:12] mutante: am umounting and fscking [14:29:13] sdh1 broken? [14:29:24] ok, cool [14:32:51] (03PS2) 10Chad: Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 [14:32:54] (03PS4) 10Chad: A context manager for managing nested loggers [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [14:33:11] (03CR) 10jenkins-bot: [V: 04-1] Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 (owner: 10Chad) [14:33:16] RECOVERY - Hadoop DataNode on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [14:33:26] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [500.0] [14:34:24] (03CR) 10Luke081515: [C: 031] Beta cluster "test.wikipedia" thinks it is "test.wikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239097 (https://phabricator.wikimedia.org/T99156) (owner: 10Reedy) [14:34:33] lol. [14:36:14] (03CR) 10Alexandros Kosiaris: [C: 032] Also check submodules with rubocop [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [14:36:16] (03CR) 10Chad: "I think we don't want PS4 here, accidentally'd some changes. Restoring PS3." [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [14:36:20] (03PS3) 10Alexandros Kosiaris: Also check submodules with rubocop [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [14:37:08] (03PS3) 10Chad: Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 [14:37:11] (03PS5) 10Chad: A context manager for managing nested loggers [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [14:39:02] (03PS1) 10Giuseppe Lavagetto: videoscaler: fix typo in mw1259 node definition [puppet] - 10https://gerrit.wikimedia.org/r/239110 [14:39:17] akosiaris: do you know whats up with mendelevium? i know it's new and not used yet for OTRS, but no SSH while ping works? [14:39:23] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] videoscaler: fix typo in mw1259 node definition [puppet] - 10https://gerrit.wikimedia.org/r/239110 (owner: 10Giuseppe Lavagetto) [14:39:41] (03PS4) 10Chad: Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 [14:40:46] goes to ganeti console [14:41:27] and sees a login there.. so far so good [14:43:32] got on it, but oh oh, it's oom kiiling things [14:43:46] mutante: ok that explains it [14:43:53] I wonder why though [14:43:54] Login timed out after 60 seconds. [14:44:03] kicked :p kind of [14:44:16] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:44:25] [85759.551122] Out of memory: Kill process 3864 (puppet) score 6 or sacrifice child [14:45:43] (03PS3) 10Rush: elastic: define codfw lvs [puppet] - 10https://gerrit.wikimedia.org/r/238507 [14:46:11] (03CR) 10JanZerebecki: [C: 031] Make gerrit offer newer key exchange algorithms for new sshs [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [14:46:22] (03CR) 10JanZerebecki: [C: 031] Ensure gerrit's plugins are kept in sync with plugin repo [puppet] - 10https://gerrit.wikimedia.org/r/238976 (owner: 10QChris) [14:47:26] mutante: loadavg 263 ? [14:47:32] wat ? [14:47:45] grmbl.i couldnt get back on ye [14:47:52] and my own connection is crappy too [14:48:19] (03CR) 10Rush: [C: 032] elastic: define codfw lvs [puppet] - 10https://gerrit.wikimedia.org/r/238507 (owner: 10Rush) [14:48:34] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=mendelevium.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [14:49:09] (03PS1) 10Chad: Convert tasks.* to use context logger [tools/scap] - 10https://gerrit.wikimedia.org/r/239112 [14:49:10] constantly in iowait ? [14:49:56] I think I am starting to undestand why [14:49:57] damn, i wonder why now. was there anything on it already or just a fresh VM and nothing else [14:50:13] this is still running /bin/sh /etc/cron.daily/spamassassin [14:50:28] it's a fresh VM with the otrs role applied [14:50:35] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:41] gotcha, so including the mail stuff [14:52:04] bblack, can you purge everything from the varnish maps cluster? We messed up yesterday right before sending out a mass email, so had to delay that. Everything is fixed, but caches are still showing broken stuff. Thanks! [14:52:44] yurik: yeah will do now [14:52:49] thanks [14:53:11] bblack, as we have fairly low traffic for now, how can we do it so we don't have to bug you each time? [14:53:27] e.g. mass purge, not individual tiles [14:53:54] akosiaris: i can ssh to it normal again and i see load is totally down. you did that, right [14:54:05] mutante: no I did not [14:54:18] I killed only the spamassasin cron [14:54:22] but that was about it [14:54:39] the part that got my attention was actually just that SSH was shown as status "Server answer:" [14:54:58] yurik: should be fully-purged now [14:55:42] thx! [14:56:06] yurik: generally speaking, I don't want you to have any easy way to mass purge. because then it becomes a crutch, and it's not one we want to use when this is a heavier production service. That's why we don't offer it for the other caches either :P [14:56:29] we need to fix other things (some technical, some about planning and such) to make sure that's not generally necessary [14:57:54] bblack, i understand about fully prod system, but for now we run into errors all the time, and it takes a long time to flush it. i could of course automate the flushing via the UDP interface (btw, filed a bug for that), but it will happen a lot while we sort out the proper map styling [14:59:04] sounds to me like we don't want caching at all right now [14:59:08] (I asked yesterday but no answer yet so will do again) it looks like sites such as gerrit, mailman and ganglia do not load (or very, very slow) for me. Gerrit is stuck at "Loading Gerrit Code Review...", loading https://lists.wikimedia.org/mailman/listinfo takes 7200ms and https://lists.wikimedia.org/images/mailman/mailman.jpg takes more than 6100ms [14:59:11] yurik: you might want to introduce a cache buster bit in your url if you need to atomically update things [14:59:15] <_joe_> akosiaris: touch your nose [14:59:34] Sites such as the English Wikipedia load normally for me. Is there someone here who experiences this too? [14:59:41] (03PS2) 10Faidon Liambotis: Switch Middle-East's backup from ulsfo to eqiad [dns] - 10https://gerrit.wikimedia.org/r/239071 [14:59:43] (03PS2) 10Faidon Liambotis: Add codfw everywhere on the map [dns] - 10https://gerrit.wikimedia.org/r/239070 [14:59:45] (03PS2) 10Faidon Liambotis: Geolocate our networks to their respective DC [dns] - 10https://gerrit.wikimedia.org/r/239069 [14:59:45] yurik: a version number, essentially [14:59:47] (03PS2) 10Faidon Liambotis: Switch Central/South Asia to esams [dns] - 10https://gerrit.wikimedia.org/r/239072 [14:59:49] (03PS1) 10Faidon Liambotis: Move codfw to be second in place at the DC list [dns] - 10https://gerrit.wikimedia.org/r/239114 [14:59:57] <_joe_> akosiaris: I was about to say the same :) [15:00:05] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150917T1500). Please do the needful. [15:00:38] _joe_: hehe [15:01:00] (03CR) 1020after4: [C: 031] Convert tasks.* to use context logger (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/239112 (owner: 10Chad) [15:01:28] RECOVERY - Disk space on labstore1002 is OK: DISK OK [15:01:52] I can SWAT this morning: Nikerabbit ping for SWAT [15:02:01] thcipriani: good evening [15:02:17] okie doke, merging [15:03:24] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238097 (https://phabricator.wikimedia.org/T111901) (owner: 10KartikMistry) [15:03:44] (03CR) 10Chad: Convert tasks.* to use context logger (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/239112 (owner: 10Chad) [15:03:44] mutante: I 'll be monitoring to see what triggered this [15:03:47] (03Merged) 10jenkins-bot: CX: Enable Suggestions in ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238097 (https://phabricator.wikimedia.org/T111901) (owner: 10KartikMistry) [15:04:15] I 've got nothing right now though [15:04:20] I experience this slowness since yesterday. I thought it might be some problem with misc-web-lb, but some other sites which are served by misc-web-lb too, load normally so.. [15:04:34] (03CR) 1020after4: [C: 031] Simplify logging in ssh module (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 (owner: 10Chad) [15:04:43] SPF|Cloud: mailman, ganglia, gerrit are not behind misc-web-lb [15:04:49] PROBLEM - configured eth on mw1259 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:00] PROBLEM - dhclient process on mw1259 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:03] oh wait, I'm sleeping, nvm [15:05:09] sorry :) [15:05:19] PROBLEM - mediawiki-installation DSH group on mw1259 is CRITICAL: Host mw1259 is not in mediawiki-installation dsh group [15:05:25] gwicke, i wish i could introduce a version - problem is that we don't control the UI - we simply give a tile URL, and users build their own labs instances [15:05:27] no worries [15:05:29] you're right. but I don't know why this happens though [15:05:31] (03PS5) 10Chad: Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 [15:05:39] PROBLEM - nutcracker port on mw1259 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:51] SPF|Cloud: only those 3 sites ? not wikipedia.org ? [15:05:57] yes [15:05:59] PROBLEM - nutcracker process on mw1259 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:06:08] and wikitechwiki. production wikis load normally [15:06:08] PROBLEM - puppet last run on mw1259 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:06:09] PROBLEM - salt-minion processes on mw1259 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:06:15] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Enable Suggestions in ptwiki [[gerrit:238097]] (duration: 00m 13s) [15:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:23] ^ Nikerabbit check please [15:06:24] (03CR) 10Chad: [C: 031] "Working for me, was able to make 2 child patches that worked off of this. Want a 2nd set of eyes first before merging :)" [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [15:06:38] (03PS2) 10Andrew Bogott: Add check for /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/238960 (https://phabricator.wikimedia.org/T97748) [15:06:40] (03PS3) 10Andrew Bogott: Added tests for grid job submission. [puppet] - 10https://gerrit.wikimedia.org/r/238863 (https://phabricator.wikimedia.org/T97748) [15:06:49] PROBLEM - DPKG on mw1259 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:06:59] PROBLEM - Disk space on mw1259 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:22] SPF|Cloud: I haven't heard anything similar these days. It's puzzling that you don't experience it on other wikimedia sites but you do on those 3 [15:07:29] PROBLEM - RAID on mw1259 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:40] actually 4 since it's wikitechwiki too [15:08:13] I can try a tracert if you want [15:08:27] (03CR) 10Filippo Giunchedi: [C: 031] "we could try it, IIRC one of the reasons this was disabled is that webproxy is fairly high traffic (also carbon is raid5, sigh) so keep an" [puppet] - 10https://gerrit.wikimedia.org/r/239009 (https://phabricator.wikimedia.org/T97119) (owner: 10Dzahn) [15:08:36] thcipriani: ahh, I just realised that the code that uses this variables only goes out tonight with the train, but all existing things work for now [15:08:48] Nikerabbit: kk [15:09:18] SPF|Cloud: logged-in or not in project sites ? trying to figure out if it's the caching that's hiding the problem for other sites [15:09:20] thcipriani: works wonderfully on testwiki ;) [15:09:24] anyone know anything about tmh1001.eqiad.wmnet ? ssh could not resolve hostname, can't ping it. [15:09:40] I'm not logged in [15:10:30] (03CR) 10Andrew Bogott: [C: 032] Add check for /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/238960 (https://phabricator.wikimedia.org/T97748) (owner: 10Andrew Bogott) [15:10:38] RECOVERY - DPKG on mw1259 is OK: All packages OK [15:10:39] RECOVERY - dhclient process on mw1259 is OK: PROCS OK: 0 processes with command name dhclient [15:10:49] RECOVERY - Disk space on mw1259 is OK: DISK OK [15:11:00] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100% [15:11:00] SPF|Cloud: ok that explains why you don't see the problem in wikipedia.org and the like [15:11:19] RECOVERY - nutcracker port on mw1259 is OK: TCP OK - 0.000 second response time on port 11212 [15:11:19] RECOVERY - RAID on mw1259 is OK: OK [15:11:38] RECOVERY - nutcracker process on mw1259 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [15:11:48] RECOVERY - salt-minion processes on mw1259 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:12:18] RECOVERY - configured eth on mw1259 is OK: OK - interfaces up [15:13:52] ah, SAL says _joe_ is reimaging tmh1001 so I guess the fact that scap fails to sync it is fine [15:14:15] thcipriani: I was about to say so [15:14:44] akosiaris: cool, just wanted to make sure, thanks :) [15:15:16] SPF|Cloud: sounds networking related, wanna file a task ? [15:15:19] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [15:15:30] sure. will attach a traceroute [15:15:36] !log elastic in eqiad plugin updates: restarting elastic1004 (take 2) [15:15:38] 6operations, 10ops-codfw, 5Patch-For-Review: setup/install/deploy new HP restbase servers for codfw - https://phabricator.wikimedia.org/T112683#1649353 (10fgiunchedi) all new machines have been provisioned and OS installed, keys signed / puppet /etc [15:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:43] 6operations, 10ops-codfw, 5Patch-For-Review: setup/install/deploy new HP restbase servers for codfw - https://phabricator.wikimedia.org/T112683#1649355 (10fgiunchedi) 5Open>3Resolved [15:17:09] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:17:39] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 47 failures [15:18:34] mutante: for some reason mendelevium is starting to have a linear load increase since 21:30 approximately yesterday [15:19:09] number of processes went up, load went up, iowait as well, but not memory [15:19:15] which given the OOM sounds strange [15:19:47] (03CR) 10Alex Monk: [C: 04-1] "There are other things that expect this to be test.wikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239097 (https://phabricator.wikimedia.org/T99156) (owner: 10Reedy) [15:20:09] attn: the frack/codfw puppetstorm is me, fixing... [15:22:38] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 47 failures [15:23:35] 6operations: videoscaler naming conventions - https://phabricator.wikimedia.org/T105009#1649389 (10Krenair) @Joe did https://gerrit.wikimedia.org/r/#/c/239103/ to rename tmh1001 -> mw1259 [15:25:22] akosiaris: but..you fixed it :) it's all idle again [15:26:37] mutante: again.. not me [15:26:45] it auto fixed itself [15:26:52] which is fine ... self healing !!! [15:27:06] (03CR) 1020after4: "I think we should convert it to use scap.log, but in that case maybe we should convert all the imports to use the same form for consistenc" [tools/scap] - 10https://gerrit.wikimedia.org/r/239112 (owner: 10Chad) [15:27:06] yes:) ok [15:27:06] box came back from the dead on its own [15:28:22] akosiaris: https://en.wiktionary.org/wiki/Scheintod#Noun :) [15:28:25] (03CR) 10Thcipriani: [C: 04-1] "Dan and I talked yesterday about this (and I talked to Filippo about it this morning): it would be fairly easy (and a Good Thing™) to keep" [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [15:28:39] 6operations: Various misc. sites (e.g. gerrit) load very slow or not at all - https://phabricator.wikimedia.org/T112902#1649430 (10Southparkfan) 3NEW [15:28:53] akosiaris ^ [15:29:05] SPF|Cloud: thanks! [15:32:53] PROBLEM - Host search.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [15:34:21] ACKNOWLEDGEMENT - Host search.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% cpettet new [15:36:44] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:37:43] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:39:04] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [15:40:09] (03CR) 1020after4: "+1 for /srv/deployment-cache" [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [15:41:59] (03CR) 10Aklapper: "Superseded by https://gerrit.wikimedia.org/r/239073" [puppet] - 10https://gerrit.wikimedia.org/r/239067 (owner: 10Aklapper) [15:45:19] (03CR) 10Alex Monk: [C: 04-1] "Don't you also want wgGroupsRemoveFromSelf?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235900 (https://phabricator.wikimedia.org/T111455) (owner: 10MarcoAurelio) [15:46:24] yes Krenair I do. [15:46:27] fixing [15:49:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:51:38] 6operations, 10Wikimedia-Extension-setup, 7Database, 5Patch-For-Review: Enable wikilove on outreachwiki - https://phabricator.wikimedia.org/T106264#1649558 (10Krenair) Anyone who can deploy the config change can make the DB change. That's not the problem here. [15:51:54] 6operations, 6Commons: delete gwtoolset job by Hansmuller - https://phabricator.wikimedia.org/T112878#1649562 (10Dzahn) I tried to follow the instructions linked above. I can get a redis-cli shell and found the password. I saw `/home/tgr/redis-gwtoolset-list.txt` and ran `redis-cli -a -h rdb1001... [15:52:55] (03PS4) 10Alexandros Kosiaris: Also check submodules with rubocop [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [15:53:01] (03CR) 10Alexandros Kosiaris: [V: 032] Also check submodules with rubocop [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [15:54:51] (03PS3) 10MarcoAurelio: Flood flag configuration changes for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235900 (https://phabricator.wikimedia.org/T111455) [15:57:11] (03CR) 10Hashar: [C: 031] Support batch size configuration per stage [tools/scap] - 10https://gerrit.wikimedia.org/r/239016 (https://phabricator.wikimedia.org/T112841) (owner: 10Dduvall) [15:57:59] (03PS3) 10Dzahn: squid: logrotate for webproxy on carbon [puppet] - 10https://gerrit.wikimedia.org/r/239009 (https://phabricator.wikimedia.org/T97119) [15:58:28] (03PS6) 10Giuseppe Lavagetto: contint: Install chromedriver for running MW-Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/223691 (https://phabricator.wikimedia.org/T103039) (owner: 10Dduvall) [15:58:32] <_joe_> !next [15:58:49] <_joe_> err [15:59:02] <_joe_> yuvipanda|maybeNOT ;) [15:59:09] jouncebot: next [15:59:09] In 0 hour(s) and 0 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150917T1600) [15:59:20] <_joe_> hashar: you here? [15:59:32] _joe_: yeah wanna start puppet swat ? [15:59:41] <_joe_> yes [16:00:05] YuviPanda _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150917T1600). [16:00:07] <_joe_> should we merge them all and test them alltoghether? [16:00:15] yeah [16:00:28] they are all deployed on the integration puppetmaster [16:00:28] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: Install chromedriver for running MW-Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/223691 (https://phabricator.wikimedia.org/T103039) (owner: 10Dduvall) [16:00:33] so the only impact I thought of is gallium.wikimedia.org [16:00:41] (03PS4) 10Giuseppe Lavagetto: contint: upgrade setuptools from pypi [puppet] - 10https://gerrit.wikimedia.org/r/234254 (https://phabricator.wikimedia.org/T110506) (owner: 10Hashar) [16:00:58] I have chained them to avoid potential conflicts [16:01:05] <_joe_> k [16:01:20] so you can grab the last one, rebase locally, push to gerrit and merge all [16:01:29] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: upgrade setuptools from pypi [puppet] - 10https://gerrit.wikimedia.org/r/234254 (https://phabricator.wikimedia.org/T110506) (owner: 10Hashar) [16:01:32] or I could have sent a merge commit :-} [16:01:45] SPF|Cloud: on the plus side, I think I 've figured out the problem. Making sure, I 'll update the task if it what I think indeed [16:01:51] (03CR) 10Giuseppe Lavagetto: [V: 032] contint: upgrade setuptools from pypi [puppet] - 10https://gerrit.wikimedia.org/r/234254 (https://phabricator.wikimedia.org/T110506) (owner: 10Hashar) [16:02:03] cool! [16:02:17] (03PS3) 10Giuseppe Lavagetto: contint: remove obsolete ruby related packages [puppet] - 10https://gerrit.wikimedia.org/r/238436 (owner: 10Hashar) [16:03:47] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: remove obsolete ruby related packages [puppet] - 10https://gerrit.wikimedia.org/r/238436 (owner: 10Hashar) [16:03:58] (03PS3) 10Giuseppe Lavagetto: contint: remove postgresql [puppet] - 10https://gerrit.wikimedia.org/r/238438 (owner: 10Hashar) [16:05:00] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: remove postgresql [puppet] - 10https://gerrit.wikimedia.org/r/238438 (owner: 10Hashar) [16:05:10] (03CR) 10Giuseppe Lavagetto: [V: 032] contint: remove postgresql [puppet] - 10https://gerrit.wikimedia.org/r/238438 (owner: 10Hashar) [16:05:24] (03PS4) 10Dzahn: Create ee.wikimedia.org for renaming from et.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/234426 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [16:05:46] (03PS3) 10Giuseppe Lavagetto: contint: remove subversion::client [puppet] - 10https://gerrit.wikimedia.org/r/238442 (owner: 10Hashar) [16:06:12] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint: remove subversion::client [puppet] - 10https://gerrit.wikimedia.org/r/238442 (owner: 10Hashar) [16:06:38] (03CR) 10Dzahn: [C: 032] "chapter confirmed: "we would still like the page of Wikimedia Eesti to be changed from et.wikimedia.org to ee.wikimedia.org."" [dns] - 10https://gerrit.wikimedia.org/r/234426 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [16:06:54] (03PS6) 10Giuseppe Lavagetto: contint: move some useful packags to a new base class [puppet] - 10https://gerrit.wikimedia.org/r/239044 (https://phabricator.wikimedia.org/T112821) (owner: 10Zfilipin) [16:07:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint: move some useful packags to a new base class [puppet] - 10https://gerrit.wikimedia.org/r/239044 (https://phabricator.wikimedia.org/T112821) (owner: 10Zfilipin) [16:07:41] <_joe_> hashar: merged [16:08:03] running puppet on gallium [16:08:13] _joe_: you were right earlier today, I should rely on puppet swat [16:08:21] that is less IRQ for you guys [16:08:25] (03CR) 10Filippo Giunchedi: "@gwicke, I prefer explicit vs implicit and likely the two variables will go together, not sure it is worth binding the two now. Also for b" [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [16:08:40] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1649617 (10Dzahn) added to DNS: ``` ee.wikimedia.org has address 208.80.154.224 ee.wikimedia.org has IPv6 address 2620:0:861:ed1a::1 ``` [16:08:41] <_joe_> and I had to take time to look at those before with calm :) [16:09:54] _joe_: gallium is happy [16:10:00] the slaves as well apparently :-} [16:10:42] <_joe_> ok, good. [16:11:28] <_joe_> I'm going off then [16:11:39] _joe_: yeah enjoy your evening and thanks! [16:12:34] Is there space for https://gerrit.wikimedia.org/r/#/c/234427/1 (add an extra domain to an apache ServerAlias) ? [16:12:58] yuvipanda: ^^^^ for puppet swat :} [16:13:29] (03CR) 10Reedy: "So the task is actually wrong, and should be updated?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239097 (https://phabricator.wikimedia.org/T99156) (owner: 10Reedy) [16:13:32] (03CR) 10Dzahn: [C: 031] Add ee.wikimedia.org to apache config for chapters [puppet] - 10https://gerrit.wikimedia.org/r/234427 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [16:15:10] (03CR) 10Alex Monk: [C: 032] Flood flag configuration changes for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235900 (https://phabricator.wikimedia.org/T111455) (owner: 10MarcoAurelio) [16:15:34] (03Merged) 10jenkins-bot: Flood flag configuration changes for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235900 (https://phabricator.wikimedia.org/T111455) (owner: 10MarcoAurelio) [16:16:08] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/235900/ (duration: 00m 12s) [16:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:17] ssh: Could not resolve hostname tmh1001.eqiad.wmnet: Name or service not known [16:16:25] didn't that just get renamed? [16:16:25] Krenair: it was renamed today [16:17:08] dsh group file.. [16:17:32] (03PS2) 10Alex Monk: Add ee.wikimedia.org to apache config for chapters [puppet] - 10https://gerrit.wikimedia.org/r/234427 (https://phabricator.wikimedia.org/T31919) [16:17:41] fixing [16:17:49] thanks [16:17:53] (03PS2) 10Faidon Liambotis: Remove home_pmtpa and svn client from bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/231142 [16:17:55] (03PS1) 10Faidon Liambotis: Remove the subversion module [puppet] - 10https://gerrit.wikimedia.org/r/239126 [16:17:57] _joe_: ^ [16:18:33] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove the subversion module [puppet] - 10https://gerrit.wikimedia.org/r/239126 (owner: 10Faidon Liambotis) [16:22:52] (03PS1) 10Dzahn: fix dsh group files for tmh1001->mw1259 rename [puppet] - 10https://gerrit.wikimedia.org/r/239127 [16:22:55] _joe_: ^ [16:23:02] thcipriani: ^ that should fix it [16:24:29] mutante: nice. Thanks! [16:24:41] (03CR) 10Alex Monk: "The task is backwards. We should be moving towards using test.wikimedia." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239097 (https://phabricator.wikimedia.org/T99156) (owner: 10Reedy) [16:25:24] (03Abandoned) 10Reedy: Beta cluster "test.wikipedia" thinks it is "test.wikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239097 (https://phabricator.wikimedia.org/T99156) (owner: 10Reedy) [16:26:39] (03PS2) 10Dzahn: fix dsh group files for tmh1001->mw1259 rename [puppet] - 10https://gerrit.wikimedia.org/r/239127 [16:27:46] (03PS3) 10Dzahn: fix dsh group files for tmh1001->mw1259 rename [puppet] - 10https://gerrit.wikimedia.org/r/239127 [16:27:57] (03CR) 10Dzahn: [C: 032] fix dsh group files for tmh1001->mw1259 rename [puppet] - 10https://gerrit.wikimedia.org/r/239127 (owner: 10Dzahn) [16:28:03] <_joe_> mutante: ach! [16:28:05] <_joe_> sorry [16:28:30] no problem, it was ready, right [16:28:54] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: puppet fail [16:29:47] 6operations: education-collab-owner@lists.wikimedia.org bounce notification - https://phabricator.wikimedia.org/T112912#1649736 (10Krenair) Adding operations [16:30:34] (03CR) 10Filippo Giunchedi: create application users (WIP) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/239000 (https://phabricator.wikimedia.org/T92590) (owner: 10Eevans) [16:30:44] 6operations: education-collab-owner@lists.wikimedia.org bounce notification - https://phabricator.wikimedia.org/T112912#1649748 (10Krenair) I think @MarkTraceur was seeing something similar? [16:31:14] 6operations, 10Wikimedia-Mailing-lists: education-collab-owner@lists.wikimedia.org bounce notification - https://phabricator.wikimedia.org/T112912#1649750 (10Krenair) [16:31:58] Did someone just put a phone number into a bug report [16:32:15] They also didn't set any projects [16:32:50] 6operations, 10Wikimedia-Mailing-lists: education-collab-owner@lists.wikimedia.org bounce notification - https://phabricator.wikimedia.org/T112912#1649758 (10MarkTraceur) I've been seeing these reports on multimedia-team and multimedia-alerts. None of the postmaster addresses work, and none of the phone number... [16:33:26] mutante, I ran [16:33:29] sync-common [16:33:33] akosiaris, is it possible to point 3 different puppet roles to the same git repo? https://phabricator.wikimedia.org/T112914 [16:33:36] on mw1259 [16:33:54] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: puppet fail [16:38:54] RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 273 seconds ago with 0 failures [16:43:14] PROBLEM - Disk space on analytics1015 is CRITICAL: DISK CRITICAL - free space: / 1493 MB (3% inode=96%) [16:43:27] !log restart elasticsearch on 1005 [16:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:39] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1649822 (10CCogdill_WMF) @BBlack no emails will be going out for the rest of the week, so today or tomorrow would be great. I'm assuming gerrit or somet... [16:48:33] (03CR) 10Rush: "I don't think we could survive with this many nodes at current load really, would 24 be more realistic?" [puppet] - 10https://gerrit.wikimedia.org/r/238850 (owner: 10EBernhardson) [16:52:21] 6operations, 6Discovery, 7Elasticsearch, 7Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#1649870 (10Deskana) [16:56:43] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 49 failures [17:00:20] (03PS1) 10Mobrovac: RESTBase: Update config [puppet] - 10https://gerrit.wikimedia.org/r/239133 [17:01:43] RECOVERY - check_puppetrun on alnilam is OK: OK: Puppet is currently enabled, last run 192 seconds ago with 0 failures [17:02:17] (03CR) 10GWicke: "Looking through the uses of the 'cluster' variable in puppet, it looks like it might actually do the right thing if we set it to 'restbase" [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [17:03:54] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 47 failures [17:04:54] RECOVERY - Disk space on analytics1015 is OK: DISK OK [17:06:53] RECOVERY - mediawiki-installation DSH group on mw1259 is OK: OK [17:08:48] (03Abandoned) 10Hoo man: Force HTTPS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/181949 (owner: 10Hoo man) [17:08:54] RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 212 seconds ago with 0 failures [17:09:39] (03CR) 10GWicke: "- The cassandra module would be able to reference the more general 'cluster' in its logstash reporter." [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [17:10:00] (03PS1) 10BBlack: define search IPs for low-traffic codfw LVS [puppet] - 10https://gerrit.wikimedia.org/r/239136 [17:11:44] (03CR) 10BBlack: [C: 032] define search IPs for low-traffic codfw LVS [puppet] - 10https://gerrit.wikimedia.org/r/239136 (owner: 10BBlack) [17:12:34] akosiaris: you said you would post an update (T112902), forgot to post it, or will you do soon? [17:12:43] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 3 failures [17:17:43] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 159 seconds ago with 0 failures [17:20:14] 7Puppet, 6Analytics-Backlog, 10Analytics-Wikimetrics: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} - https://phabricator.wikimedia.org/T101763#1649942 (10madhuvishy) p:5Normal>3Low [17:21:33] 6operations, 10netops: Various misc. sites (e.g. gerrit) load very slow or not at all - https://phabricator.wikimedia.org/T112902#1649949 (10faidon) [17:22:34] 6operations, 10netops: Various misc. sites (e.g. gerrit) load very slow or not at all - https://phabricator.wikimedia.org/T112902#1649430 (10faidon) I didn't find evidence of congestion but that's likely because I didn't look during peak hours. I've changed the return path to your ISP to go via a different tr... [17:26:35] (03CR) 10Ori.livneh: "The commit message makes sense, but opening up access ought to have better reasons for it than "it's what we did elsewhere", IMO." [puppet] - 10https://gerrit.wikimedia.org/r/239023 (owner: 10Dzahn) [17:29:34] (03CR) 10Dzahn: "@ori i think there is value in itself that we can expect the same from hosts with the same name and roles. but there is also more than tha" [puppet] - 10https://gerrit.wikimedia.org/r/239023 (owner: 10Dzahn) [17:32:49] _joe_: next [17:32:52] err [17:32:56] jouncebot: next [17:32:56] In 0 hour(s) and 27 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150917T1800) [17:32:59] _joe_: sorry :P [17:34:59] 6operations, 10netops: Various misc. sites (e.g. gerrit) load very slow or not at all - https://phabricator.wikimedia.org/T112902#1650016 (10Southparkfan) 5Open>3Resolved a:3Southparkfan Yes, that fixed it. Thanks Faidon. [17:35:34] !log legoktm@tin Synchronized php-1.26wmf23/includes/registration/ExtensionRegistry.php: registration: Fix merging of array_plus (duration: 00m 11s) [17:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:43] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet has 70 failures [17:38:03] !log legoktm@tin Synchronized php-1.26wmf22/includes/registration/ExtensionRegistry.php: registration: Fix merging of array_plus (duration: 00m 13s) [17:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:52] 6operations, 10Wikimedia-Mailing-lists: education-collab-owner@lists.wikimedia.org bounce notification - https://phabricator.wikimedia.org/T112912#1650032 (10Dzahn) "If **bounce_unrecognized_goes_to_list_owner is Yes**, any message received at the listname-bounces address which is not recognized by Mailman as... [17:42:43] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet has 70 failures [17:44:18] 6operations, 10Wikimedia-Mailing-lists: education-collab-owner@lists.wikimedia.org bounce notification - https://phabricator.wikimedia.org/T112912#1650056 (10Dzahn) see these settings: [[ https://lists.wikimedia.org/mailman/admin/education-collab/?VARHELP=bounce/bounce_processing | Should Mailman perform auto... [17:45:33] RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 34.47 ms [17:46:05] 6operations, 10Wikimedia-Mailing-lists: education-collab-owner@lists.wikimedia.org bounce notification - https://phabricator.wikimedia.org/T112912#1650061 (10Dzahn) quote from mailman UI: //While Mailman's bounce detector is fairly robust, it's impossible to detect every bounce format in the world. You should... [17:47:43] RECOVERY - check_puppetrun on mintaka is OK: OK: Puppet is currently enabled, last run 92 seconds ago with 0 failures [17:53:50] (03CR) 10EBernhardson: "i agree 16 is still to optimistic, i don't know that 24 would be ok but its probably better than 16." [puppet] - 10https://gerrit.wikimedia.org/r/238850 (owner: 10EBernhardson) [17:55:22] _joe_: need anything more from me on the video scalers? [17:55:38] (03CR) 10Catrope: [C: 031] Set $wgFlowMigrateReferenceWiki false on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234207 (https://phabricator.wikimedia.org/T107204) (owner: 10Mattflaschen) [17:56:03] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/239133 (owner: 10Mobrovac) [18:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150917T1800). Please do the needful. [18:00:33] I love you jouncebot. [18:01:44] :) [18:05:36] 6operations, 10Wikimedia-Mailing-lists: education-collab-owner@lists.wikimedia.org bounce notification - https://phabricator.wikimedia.org/T112912#1650147 (10eliza) Thank you Dzahn, So I should convey the following information to Samir? Should Mailman perform automatic bounce processing? < https://lists.wiki... [18:08:51] Ok jouncebot, doing the needful. [18:11:56] 6operations, 10Wikimedia-Mailing-lists: education-collab-owner@lists.wikimedia.org bounce notification - https://phabricator.wikimedia.org/T112912#1650182 (10Tbayer) The same thing is happening on the social-media, research-newsletter and wikimediaannounce-l lists. [18:12:33] (03PS1) 10Brion VIBBER: Update ssh key for brion [puppet] - 10https://gerrit.wikimedia.org/r/239155 [18:13:30] (03PS1) 1020after4: wikipedia wikis to 1.26wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239156 [18:13:45] (03CR) 1020after4: [C: 032] wikipedia wikis to 1.26wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239156 (owner: 1020after4) [18:13:51] (03Merged) 10jenkins-bot: wikipedia wikis to 1.26wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239156 (owner: 1020after4) [18:13:53] <_joe_> brion: actually, no. Just a heads-up early next week when I reimage the last host [18:14:09] _joe_: awesome, thanks :D [18:14:15] i'll tinker at them and report if i see any problems [18:14:17] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedia wikis to 1.26wmf23 [18:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:26] seeing some timeouts but that may just be because paladox is hitting the resets hard ;) [18:16:44] twentyafterfour: some interface messages seem to be missing again... it was fixed for testwiki two days ago somehow [18:19:06] 6operations, 6Commons: delete gwtoolset job by Hansmuller - https://phabricator.wikimedia.org/T112878#1650215 (10Bawolff) https://commons.wikimedia.org/w/index.php?title=Special%3ALog&type=gwtoolset&user=Hansmuller&page=&year=&month=-1&tagfilter= still lists new things happening, so I guess not. [18:19:18] 6operations, 6Commons: delete gwtoolset job by Hansmuller - https://phabricator.wikimedia.org/T112878#1650216 (10Bawolff) [18:20:08] (03PS1) 10Jdlrobson: Replicate browser test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) [18:20:15] (03CR) 10jenkins-bot: [V: 04-1] Replicate browser test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [18:20:44] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1650219 (10BBlack) Thanks! I'm going to do it now, and yeah gerrit will log the merge here regardless. [18:20:56] (03PS2) 10Jdlrobson: Replicate browser test config for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) [18:20:59] (03PS2) 10BBlack: Enable IPv6 for donate.wm.o [dns] - 10https://gerrit.wikimedia.org/r/235749 (https://phabricator.wikimedia.org/T73267) [18:21:27] (03CR) 10BBlack: [C: 032] Enable IPv6 for donate.wm.o [dns] - 10https://gerrit.wikimedia.org/r/235749 (https://phabricator.wikimedia.org/T73267) (owner: 10BBlack) [18:22:50] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1650229 (10BBlack) (btw, the relevant TTLs are 600s, so you may not see the full effect for at least 10 minutes. Maybe longer for some clients, if they... [18:24:09] Nikerabbit: hmm, odd [18:24:48] should I rebuild the localization cache? I don't know what would cause missing interface messages [18:26:03] twentyafterfour: I don't know either [18:27:41] two days ago is when wmf23 was pushed to testwiki [18:28:04] Nikerabbit: can you give me an example of what you're seeing? [18:28:44] twentyafterfour: just a minute [18:31:07] twentyafterfour: https://www.dropbox.com/s/d0krcsluzyzsdo1/cx-wmf.png?dl=0 [18:31:21] [18:33:36] :O [18:33:37] why isn't it falling back to English? [18:34:46] Platonides: because messages are missing somewhere? [18:36:19] Platonides: probably resource loader message cache [18:36:23] mwscript eval.php --wiki ptwiki [18:36:23] > echo wfMessage( 'cx-translation-filter-published-translations' )->plain(); [18:36:26] Publicadas [18:40:54] csteipp: does login via token work on https://wikitech.wikimedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page&returntoquery=mobileaction%3Dtoggle_view_mobile%26welcome%3Dyes now? [18:40:58] i don't use one so unable to check [18:41:41] jdlrobson: Hmm? You mean "keep me logged in"? [18:41:49] Or you mean the OATH token? [18:42:51] 2FA works for me [18:43:53] yay [18:43:57] sweet [18:44:00] yeh the OAUTH token [18:50:52] 6operations, 6Commons: delete gwtoolset job by Hansmuller - https://phabricator.wikimedia.org/T112878#1650388 (10Krenair) What about now? [18:52:42] 10Ops-Access-Requests, 6operations, 10Beta-Cluster, 6Labs: Add AWight and EWulczyn to the deployment-prep Nova project - https://phabricator.wikimedia.org/T112927#1650389 (10awight) 3NEW [18:52:44] 6operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#1650397 (10Papaul) I had a talk with Thomas on this case. Thomas is the person responsible of the air flow regulation in DH7. He suggestion is to put blinks n all the cabinets that are not full with server (A2,... [18:54:01] (03CR) 10GWicke: RESTBase: Update config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/239133 (owner: 10Mobrovac) [18:54:05] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1650402 (10Dzahn) @Sj I just talked with @JGulingan at office and she created sklein@ on the Google side and added the aliases you had (sj@ sam@ samuel@) and will activate the autoresponder. So i just deleted the aliases... [18:54:25] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1650403 (10Dzahn) a:5JGulingan>3Dzahn [18:54:49] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1650406 (10Dzahn) 5stalled>3Resolved all done on our side, feel free to reopen if any questions. [18:57:10] 6operations, 6Commons: delete gwtoolset job by Hansmuller - https://phabricator.wikimedia.org/T112878#1650420 (10Tgr) The job delete commands are in `redis-gwtoolset-clear.txt`, not `redis-gwtoolset-list.txt`. I ran them a couple minutes ago, but it looks like the uploads have stopped an hour ago anyway. [18:57:18] 10Ops-Access-Requests, 6operations, 10Beta-Cluster, 6Labs: Add AWight and EWulczyn to the deployment-prep Nova project - https://phabricator.wikimedia.org/T112927#1650422 (10Krenair) Why is this in #Ops-Access-Requests and #operations? [18:57:24] twentyafterfour, try grepping without the initial [ ¿ [18:59:20] 6operations, 6Commons: delete gwtoolset job by Hansmuller - https://phabricator.wikimedia.org/T112878#1650427 (10Dzahn) Yes, i wanted to just list the jobs and confirm it's the right thing. Thanks, @Tgr for running it. [18:59:25] 10Ops-Access-Requests, 6operations, 10Beta-Cluster, 6Labs: Add AWight and EWulczyn to the deployment-prep Nova project - https://phabricator.wikimedia.org/T112927#1650429 (10greg) 1) Beta Cluster, not beta labs (please rename your wikipage), see https://wikitech.wikimedia.org/wiki/Labs_labs_labs 2) no nee... [19:02:11] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1650442 (10JGulingan) @Sj I've set your autoresponder to saying "Thank you for your message. This email address is no longer active, so you can reach Samuel at meta.sj@gmail.com." Just to let you know, I will delete this a... [19:08:24] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:34] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:43] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:09:03] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:09:04] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:09:13] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:09:15] PROBLEM - puppet last run on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:09:53] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:11:36] SPF|Cloud: no, I did not forget. I was discussing it with paravoid. IIRC he posted an update, we changed the return path to your ISP [19:12:03] !log powercycling unresponse mw1005 [19:12:05] yeah, I saw it [19:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:12:14] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:12:28] Thanks for that. slow sites are so annoying [19:12:44] PROBLEM - salt-minion processes on mw1005 is CRITICAL: Timeout while attempting connection [19:14:24] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:14:24] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures [19:14:54] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [19:15:14] RECOVERY - DPKG on mw1005 is OK: All packages OK [19:15:24] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [19:15:35] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [19:15:36] RECOVERY - Disk space on mw1005 is OK: DISK OK [19:15:53] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [19:15:54] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:16:03] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [19:19:31] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1650572 (10Eevans) [19:20:15] (03PS1) 10Ottomata: Rename analytics nodes to aqs (analytics query service), put them in private1 vlans [dns] - 10https://gerrit.wikimedia.org/r/239175 (https://phabricator.wikimedia.org/T111053) [19:20:49] 6operations, 3Discovery-Maps-Sprint: water_polygons import is broken - https://phabricator.wikimedia.org/T112831#1650581 (10Dzahn) p:5Triage>3Normal [19:21:08] 6operations: operations/software/conftool fails tox-py27-jessie - https://phabricator.wikimedia.org/T112853#1650583 (10Dzahn) p:5Triage>3Normal [19:21:23] 6operations, 6Commons: delete gwtoolset job by Hansmuller - https://phabricator.wikimedia.org/T112878#1650585 (10Dzahn) p:5Triage>3Normal [19:23:36] (03PS3) 10Dzahn: Add ee.wikimedia.org to apache config for chapters [puppet] - 10https://gerrit.wikimedia.org/r/234427 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [19:23:47] (03CR) 10Dzahn: [C: 032] Add ee.wikimedia.org to apache config for chapters [puppet] - 10https://gerrit.wikimedia.org/r/234427 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [19:26:40] (03PS1) 10Ottomata: Rename analytics1011, 1016, and 1019 to aqs1001, 1002, 1003 [puppet] - 10https://gerrit.wikimedia.org/r/239177 (https://phabricator.wikimedia.org/T111053) [19:27:09] 6operations, 10hardware-requests, 5Patch-For-Review: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1650612 (10Ottomata) Ok! https://gerrit.wikimedia.org/r/#/c/239175/ > This is disconcerting though. We got 3 so we should be able to bear the problems associated with 1 (... [19:28:53] Krenair: ^ adding the ServerAlias for ee.wm made it redirect to ee.wp now [19:29:01] yep, thanks [19:29:06] oh, ee.wp? ugh [19:29:15] right, well that's not hugely surprising [19:29:15] yea, eh [19:29:29] but it should be like any other chapter [19:30:17] yeah, I know [19:30:23] (03CR) 10Dzahn: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/234427 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [19:30:24] the mediawiki side is not done yet, just the initial ops steps [19:30:44] ack [19:31:19] give it a little while, just letting puppet deploy [19:31:25] i tested on mw1033 [19:34:31] (03PS1) 10Alex Monk: Make ee.wikimedia.org work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239181 (https://phabricator.wikimedia.org/T31919) [19:34:37] (03PS4) 10Andrew Bogott: Added tests for grid job submission. [puppet] - 10https://gerrit.wikimedia.org/r/238863 (https://phabricator.wikimedia.org/T97748) [19:34:39] (03PS1) 10Andrew Bogott: Add a read/write/delete check for tools-db [puppet] - 10https://gerrit.wikimedia.org/r/239182 (https://phabricator.wikimedia.org/T97748) [19:34:41] (03PS1) 10Andrew Bogott: toolschecker: test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) [19:35:23] twentyafterfour, still deploying? [19:35:37] (03CR) 10jenkins-bot: [V: 04-1] Add a read/write/delete check for tools-db [puppet] - 10https://gerrit.wikimedia.org/r/239182 (https://phabricator.wikimedia.org/T97748) (owner: 10Andrew Bogott) [19:35:43] (03CR) 10jenkins-bot: [V: 04-1] toolschecker: test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) (owner: 10Andrew Bogott) [19:37:08] 6operations, 3Discovery-Maps-Sprint: water_polygons import is broken - https://phabricator.wikimedia.org/T112831#1650683 (10Dzahn) @MaxSem please link me to the gerrit change we just talked about in the kitchen [19:38:27] (03PS2) 10Andrew Bogott: toolschecker: test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) [19:38:29] (03PS2) 10Andrew Bogott: Add a read/write/delete check for tools-db [puppet] - 10https://gerrit.wikimedia.org/r/239182 (https://phabricator.wikimedia.org/T97748) [19:39:40] taking that as a no... [19:40:16] (03CR) 10Alex Monk: [C: 032] Make ee.wikimedia.org work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239181 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [19:40:22] (03Merged) 10jenkins-bot: Make ee.wikimedia.org work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239181 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [19:41:03] !log krenair@tin Synchronized multiversion/MWMultiVersion.php: https://gerrit.wikimedia.org/r/#/c/239181/ (duration: 00m 14s) [19:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:46] (03CR) 10coren: [C: 04-1] "Looks good to me functionally, but I'd really rather you invoke '/bin/true' or even '/bin/sleep' rather than add an extra python script th" [puppet] - 10https://gerrit.wikimedia.org/r/238863 (https://phabricator.wikimedia.org/T97748) (owner: 10Andrew Bogott) [19:43:40] urgh... [19:43:54] not quite. I know what's up [19:44:00] (03PS3) 10Dzahn: Move host contact_groups to hiera and migrate existing [puppet] - 10https://gerrit.wikimedia.org/r/235065 (owner: 10John F. Lewis) [19:44:45] (03CR) 10coren: [C: 031] "Sane." [puppet] - 10https://gerrit.wikimedia.org/r/239182 (https://phabricator.wikimedia.org/T97748) (owner: 10Andrew Bogott) [19:45:58] when adding secrets/password in puppet, please also add a fake secret in labs/private [19:46:10] !log krenair@tin Synchronized multiversion/MWMultiVersion.php: (no message) (duration: 00m 12s) [19:46:16] so that we can keep using the puppet compiler to test thigns [19:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:32] it's often broken due to this [19:47:11] 6operations, 3Discovery-Maps-Sprint: water_polygons import is broken - https://phabricator.wikimedia.org/T112831#1650754 (10MaxSem) https://gerrit.wikimedia.org/r/#/c/238993/ [19:47:16] mutante, ^^^ [19:47:54] (03PS2) 10Dzahn: Temporarily disable import_waterlines cronjob [puppet] - 10https://gerrit.wikimedia.org/r/238993 (owner: 10MaxSem) [19:47:55] ok! [19:48:24] (03PS1) 10Alex Monk: Make ee.wikimedia.org work, take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239187 (https://phabricator.wikimedia.org/T31919) [19:48:47] (03CR) 10Alex Monk: [C: 032] Make ee.wikimedia.org work, take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239187 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [19:48:52] (03Merged) 10jenkins-bot: Make ee.wikimedia.org work, take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239187 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [19:49:30] (03CR) 10Dzahn: [C: 032] "https://en.wikipedia.org/wiki/Slartibartfast is still working on the coast line" [puppet] - 10https://gerrit.wikimedia.org/r/238993 (owner: 10MaxSem) [19:51:02] 6operations, 3Discovery-Maps-Sprint: water_polygons import is broken - https://phabricator.wikimedia.org/T112831#1650775 (10Dzahn) merged. that cron job will be disabled on next puppet run. [19:52:09] (03CR) 10Ottomata: "I just tested this on analytics1004 in two ways:" [puppet] - 10https://gerrit.wikimedia.org/r/237688 (https://phabricator.wikimedia.org/T112265) (owner: 10Ottomata) [19:55:06] (03PS2) 10GWicke: RESTBase: Update config [puppet] - 10https://gerrit.wikimedia.org/r/239133 (owner: 10Mobrovac) [19:55:22] (03CR) 10Ottomata: Rsync api log archives from fluorine to stat1002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://phabricator.wikimedia.org/T112744) (owner: 10Addshore) [19:56:05] (03CR) 10GWicke: [C: 031] "Given the amount of delta between this & what we actually tested in staging, I think we should test this once more in staging before movin" [puppet] - 10https://gerrit.wikimedia.org/r/239133 (owner: 10Mobrovac) [19:58:23] (03PS5) 10Andrew Bogott: Added tests for grid job submission. [puppet] - 10https://gerrit.wikimedia.org/r/238863 (https://phabricator.wikimedia.org/T97748) [19:58:25] (03PS3) 10Andrew Bogott: toolschecker: test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) [19:58:27] (03PS3) 10Andrew Bogott: Add a read/write/delete check for tools-db [puppet] - 10https://gerrit.wikimedia.org/r/239182 (https://phabricator.wikimedia.org/T97748) [19:59:26] 6operations, 3Discovery-Maps-Sprint: water_polygons import is broken - https://phabricator.wikimedia.org/T112831#1650812 (10Dzahn) p:5Normal>3Low [20:00:41] (03PS6) 10Andrew Bogott: Added tests for grid job submission. [puppet] - 10https://gerrit.wikimedia.org/r/238863 (https://phabricator.wikimedia.org/T97748) [20:01:53] (03CR) 10Andrew Bogott: [C: 032] Added tests for grid job submission. [puppet] - 10https://gerrit.wikimedia.org/r/238863 (https://phabricator.wikimedia.org/T97748) (owner: 10Andrew Bogott) [20:02:19] (03PS1) 10Alex Monk: Add ee.wikimedia.org to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/239192 (https://phabricator.wikimedia.org/T31919) [20:02:50] (03PS2) 10Dzahn: Add ee.wikimedia.org to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/239192 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [20:03:41] (03CR) 10Dzahn: [C: 032] Add ee.wikimedia.org to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/239192 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [20:04:34] (03PS4) 10Andrew Bogott: Add a read/write/delete check for tools-db [puppet] - 10https://gerrit.wikimedia.org/r/239182 (https://phabricator.wikimedia.org/T97748) [20:05:23] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/897/" [puppet] - 10https://gerrit.wikimedia.org/r/235065 (owner: 10John F. Lewis) [20:05:25] 6operations, 10ops-codfw, 10netops: cr1-eqdfw PEM 0 failure - https://phabricator.wikimedia.org/T110435#1650843 (10Papaul) I received the part replacement for cr-eqdfw. I will go on site tomorrow at 10:00am to do the replacement. [20:05:31] (03PS4) 10EBernhardson: [elasticsearch] Update recover_after_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/238850 [20:06:02] (03CR) 10Andrew Bogott: [C: 032] Add a read/write/delete check for tools-db [puppet] - 10https://gerrit.wikimedia.org/r/239182 (https://phabricator.wikimedia.org/T97748) (owner: 10Andrew Bogott) [20:11:52] (03PS4) 10Dzahn: Move host contact_groups to hiera and migrate existing [puppet] - 10https://gerrit.wikimedia.org/r/235065 (owner: 10John F. Lewis) [20:16:53] (03PS4) 10Andrew Bogott: toolschecker: test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) [20:16:56] (03PS1) 10Andrew Bogott: Toolschecker: Fix the test url for the toolsdb check [puppet] - 10https://gerrit.wikimedia.org/r/239196 (https://phabricator.wikimedia.org/T97748) [20:17:12] (03CR) 10Andrew Bogott: [C: 032] Toolschecker: Fix the test url for the toolsdb check [puppet] - 10https://gerrit.wikimedia.org/r/239196 (https://phabricator.wikimedia.org/T97748) (owner: 10Andrew Bogott) [20:18:08] (03CR) 10Dzahn: [C: 032] Move host contact_groups to hiera and migrate existing [puppet] - 10https://gerrit.wikimedia.org/r/235065 (owner: 10John F. Lewis) [20:18:26] godog, mutante: do either of you know the GPG key used to sign packages at releases.wikimedia.org ? [20:18:35] (03PS2) 10Andrew Bogott: Toolschecker: Fix the test url for the toolsdb check [puppet] - 10https://gerrit.wikimedia.org/r/239196 (https://phabricator.wikimedia.org/T97748) [20:18:47] I'm getting: "W: GPG error: https://releases.wikimedia.org jessie-mediawiki InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 7A322AC6E84AFDD2" but I can't find a key with that ID at keys.gnupg.net [20:19:14] (03PS1) 10Alex Monk: Update server variables for ee.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239206 (https://phabricator.wikimedia.org/T31919) [20:19:31] oooh, that's good news though that it's jessie! [20:19:57] gwicke: ^ https://releases.wikimedia.org/debian/dists/jessie-mediawiki/ just the key issue cscott reports [20:20:06] mutante: gwicke said I should ask you. ;) [20:20:10] i like how it even exists [20:20:28] mutante: yeah, everything else works like a charm [20:20:34] (03CR) 10Madhuvishy: [C: 031] Set replace=True for EventLogging MySQL consumer [puppet] - 10https://gerrit.wikimedia.org/r/237688 (https://phabricator.wikimedia.org/T112265) (owner: 10Ottomata) [20:21:04] (04:14:54 PM) cscott-free: gwicke: releases uses keyid 7A322AC6E84AFDD2 but that doesn't seem to be present in keys.gnupg.net [20:21:04] (04:15:10 PM) cscott-free: is there another keyserver I should check? [20:21:04] (04:15:25 PM) gwicke: not sure.. normally they all exchange keys [20:21:04] (04:15:34 PM) gwicke: maybe it's not published? [20:21:07] (04:16:11 PM) cscott-free: or else i'm using the wrong string for the keyid. but apt-get update says: [20:21:07] (04:16:11 PM) cscott-free: W: GPG error: https://releases.wikimedia.org jessie-mediawiki InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 7A322AC6E84AFDD2 [20:21:09] (04:16:32 PM) cscott-free: so i'm hoping that 7A322AC6E84AFDD2 maps in some straightforward way to the key id. maybe not, though? [20:21:09] (04:16:49 PM) gwicke: maybe ask mutante in the ops channel? godog would know otherwise [20:21:12] (04:17:25 PM) gwicke: I'm sure they would be happy to upload the public key [20:21:13] cscott: aye, the public key is in puppet, I just sent it to the keyservers [20:21:14] ^ backlog from #mediawiki-parsoid [20:21:16] (03CR) 10Alex Monk: [C: 032] Update server variables for ee.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239206 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [20:21:22] (03Merged) 10jenkins-bot: Update server variables for ee.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239206 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [20:21:27] godog: is 7A322AC6E84AFDD2 the correct key id? [20:21:36] :) yay, godog [20:21:43] cscott: it is! from modules/releases/files/pubring.gpg [20:21:58] sweet! [20:22:10] cscott: I'll put some more instructions on wikitech at least, it could use some love [20:22:22] https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org [20:22:28] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/239206/ (duration: 00m 12s) [20:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:35] np, thanks for the reminder [20:22:48] 6operations, 5Patch-For-Review: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1650932 (10Dzahn) 5Open>3Resolved https://releases.wikimedia.org/debian/dists/jessie-mediawiki/ :) [20:23:03] yep that one [20:23:18] 6operations: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1650938 (10Dzahn) [20:24:14] godog, mutante, cscott: added basic info on the repo on that page [20:24:43] fetch instructions are "sudo apt-key advanced --keyserver keys.gnupg.net --recv-keys 7A322AC6E84AFDD2" [20:24:48] but that's not working yet [20:24:59] oh, wait, it just worked. yay! [20:24:59] cscott: mind adding that? [20:25:07] yeah, i'm updating all the docs. [20:25:42] cscott: awesome. godog, mutante, cscott: thank you all! [20:26:29] nice! thanks gwicke cscott mutante ! [20:26:43] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 19 failures [20:28:24] gwicke, do you know what needs to be done to make a restbase config change take effect? [20:28:32] gwicke: added "how to add new distro" [20:29:15] Krenair: a restbase deploy is needed for that [20:29:32] we happen to have one coming up within the next hour [20:29:47] :) nice, we added "ee.wikimedia" [20:29:57] yeah, I saw [20:30:07] thanks for keeping track of those renames! [20:30:13] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [20:30:44] hello csteipp , have a second to talk about: https://phabricator.wikimedia.org/T108843 [20:31:22] gwicke, the et.wikimedia removal will happen at some point (likely when the apache redirect goes into place), but it's not a critical part of the process [20:31:35] (03CR) 10Dzahn: "no change on neon, contact groups identical (but did it even work before?)" [puppet] - 10https://gerrit.wikimedia.org/r/235065 (owner: 10John F. Lewis) [20:31:43] nuria: Sure [20:31:43] RECOVERY - check_puppetrun on alnilam is OK: OK: Puppet is currently enabled, last run 181 seconds ago with 0 failures [20:32:15] nuria: Move to wikimedia-tech to avoid bot spam? [20:32:35] csteipp: the idea is to see how hard it is for someone with access to the cluster to reconstruct a browsing session? [20:32:54] as in , visited pages [20:32:59] nuria: Sortof [20:33:15] csteipp: ok, what else? [20:33:29] Turns out we have a bunch of sites calling themselves "Wikipedia" which aren't Wikipedias... [20:33:33] shock, etc. [20:35:12] Actually, let me correct that [20:35:13] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 134 seconds ago with 0 failures [20:35:22] SiteMatrix thinks we have a bunch of non-wikipedia "Wikipedia" named sites [20:35:25] SiteMatrix lies [20:35:36] Krenair: kk [20:35:44] aude, https://gerrit.wikimedia.org/r/#/c/239209/1/lib/sitematrix.json [20:35:45] nuria: The issue this stems from is whether pageviews should be kept indefinitely. So the issue is, given a forever log of pageviews, and a correlation of some recent history and IP in one of our logs (webrequests, eventlogging, ops logs, etc), how likely can we tie a real identity to the pageview history from >90 days ago. [20:35:59] test.wikidata.org is not calling itself "Wikipedia", but sitematrix says it is...? [20:36:08] same for the wikimaniawikis [20:36:14] and wikidata.org itself [20:36:43] (03CR) 10Dzahn: "@neon:/etc/icinga# grep -A1 "admins,analytics" puppet_services.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/235065 (owner: 10John F. Lewis) [20:37:12] hallo [20:37:56] there's a problem with new messages in our newly deployed feature [20:37:58] https://pt.wikipedia.org/wiki/Especial:ContentTranslation [20:37:59] csteipp: it seems quite possible but not due to the pageviews but rather cause pageviews+geographical info are stored together [20:38:11] (you'll have to log in and enable the ContentTranslation beta feature to see this) [20:38:24] nuria: So my specific example when talking with Joseph was, if we have pageviews for a very unique geographic area, and there's only one person in the geography, then someone with access to our IP's and can find one that geolocates there, they have the real id. I'd like to understand how common that is. [20:38:28] is it a known issue? [20:38:49] nuria: Yes pageviews + geographic is the worst. Very unique UserAgent would be the same issue too. [20:38:56] (iirc) [20:39:03] csteipp: understanding that IP when it comes to #G connections is pretty meaningless right? [20:39:35] (03CR) 10Ori.livneh: [C: 032] "This is awesome. Very well done." [debs/pybal] - 10https://gerrit.wikimedia.org/r/238152 (https://phabricator.wikimedia.org/T102394) (owner: 10Giuseppe Lavagetto) [20:39:36] csteipp: cause we are going to find tons of matches for regions in which mobile users share the same IP [20:39:46] csteipp: wouldn't edits have a much higher signal / noise for that purpose? [20:39:50] #G connections? [20:39:57] (03Merged) 10jenkins-bot: Add instrumentation [debs/pybal] - 10https://gerrit.wikimedia.org/r/238152 (https://phabricator.wikimedia.org/T102394) (owner: 10Giuseppe Lavagetto) [20:39:58] csteipp: sorry 3G [20:40:46] nuria: that's carrier dependent. Some give IP per device (especially ones that are ipv6), others nat. [20:40:46] it's much easier for an individual to impact edits of an article than it is to move views much [20:41:24] gwicke: I don't follow [20:41:26] and GET vs. POST can be inferred from the data transfer, even with HTTPS [20:41:59] gwicke: This isn't about traffic analysis, this is stored data at the WMF [20:42:17] csteipp: we already publish the edit history, and I think those are more identifying than views [20:42:55] gwicke: The issue is us revealing a person's reading history [20:43:13] csteipp: that is less common than global IPs though on my experience, what I wanted to point out is that in mobile IP most of the time would not imply an individual , so: "will we find tons of records on the same city (even a small one) with the same IP? " Yes, we shall. [20:43:19] aharoni, I think Nikerabbit was talking to twentyafterfour about it earlier [20:43:19] real id => reading history, not the reverse [20:43:33] Krenair: probably [20:43:45] (03CR) 10Dzahn: [C: 031] Adding comment on disabling anon page creation on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari) [20:43:47] I'm wondering how can it be fixed ASAP [20:43:57] it's a brand new feature of which should be proud, but instead it's broken [20:44:12] csteipp: i will spend some time doing queries but -unless I am missing something - the false positives will be "all IOS8 users on verizon on toronto" [20:44:22] nuria: Right. The question is do we have pageviews with very small (single) numbers of IP's that geolocate there. [20:44:36] !log krenair@tin Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 12s) [20:44:37] Not large numbers [20:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:45:01] csteipp: Ok, will update you with results. [20:45:08] (03CR) 10Dzahn: [C: 04-1] "yes, after the move please (if we need it)" [puppet] - 10https://gerrit.wikimedia.org/r/231973 (https://phabricator.wikimedia.org/T82576) (owner: 10Ori.livneh) [20:45:32] (03CR) 10Dzahn: "bump, @ottomata is it still -1 from you?" [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [20:45:45] nuria: Cool. So just to be clear, running querries where we have geographies with *small* numbers of unique IP's, right? [20:46:02] I think Joseph and I discussed >10 [20:46:23] Krenair: https://phabricator.wikimedia.org/T112964 FWIW [20:46:24] <10, I mean :) [20:46:38] csteipp: yes, is going to be two steps I think, but understood. [20:46:51] aharoni, ty [20:47:12] Krenair: with which projects should it be tagged? It's probably a deployment issue, not an extension issue. [20:47:21] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1651135 (10Qgil) @MarcoAurelio, you're in. Please follow https://www.mediawiki.org/wiki/Phabricator/Creating_and_renaming_projects [20:47:32] aharoni, Deployment-systems perhaps? I'd certainly CC twentyafterfour [20:48:02] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1651147 (10Dzahn) I merged https://gerrit.wikimedia.org/r/#/c/235065/... [20:48:14] csteipp: I wonder how effective traffic analysis is these days to identify pages accessed, with spdy / http2 reducing the amount of information available [20:49:03] gwicke: I predict it's still easy, but would be interesting to find out, sure [20:49:11] it's definitely gotten better, but for a small project I wouldn't be surprised if it was fairly straightforward to identify page names based on the traffic fingerprint [20:50:00] especially for large pages [20:51:02] (03PS4) 10Dzahn: squid: logrotate for webproxy on carbon [puppet] - 10https://gerrit.wikimedia.org/r/239009 (https://phabricator.wikimedia.org/T97119) [20:51:10] (03PS1) 10Alex Monk: Redirect et.wikimedia.org to ee.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) [20:51:18] (03CR) 10jenkins-bot: [V: 04-1] Redirect et.wikimedia.org to ee.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [20:51:54] (03PS2) 10Alex Monk: Redirect et.wikimedia.org to ee.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) [20:53:20] (03CR) 10Dduvall: [C: 04-1] "This makes a lot more sense now that I'm seeing it applied and seems like a great overall improvement." (034 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [20:58:47] (03PS1) 10Yurik: Added wikivoyage-ev.org to maps referrer check [puppet] - 10https://gerrit.wikimedia.org/r/239279 [20:59:14] bblack, could you take a look ^ [20:59:57] 6operations, 10Wikimedia-Mailing-lists: education-collab-owner@lists.wikimedia.org bounce notification - https://phabricator.wikimedia.org/T112912#1651237 (10eliza) Hello All, Not sure if I should send another Phabrictor ticket, but we just received this ticket from Nick: Hi! I'm getting a number of these me... [21:01:08] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1651240 (10MarcoAurelio) >>! In T706#1651135, @Qgil wrote: > @MarcoAurelio, you're in. Please follow https://www.mediawiki.org/wiki/Phabricato... [21:07:08] (03CR) 1020after4: "The only problem with just using the context manager without the global function is that you wouldn't be able to get a logger without crea" (034 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [21:08:50] mutante, thanks for getting those puppet patches through. only remaining item should be https://gerrit.wikimedia.org/r/#/c/239278/ [21:09:02] then I think we can close this [21:13:14] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=27.83 Read Requests/Sec=730.15 Write Requests/Sec=9.64 KBytes Read/Sec=76025.58 KBytes_Written/Sec=292.79 [21:16:44] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.10 Read Requests/Sec=8.81 Write Requests/Sec=7.71 KBytes Read/Sec=35.24 KBytes_Written/Sec=71.67 [21:20:01] (03CR) 10Dduvall: A context manager for managing nested loggers (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [21:20:35] jynus, hi. I think https://www.mediawiki.org/wiki/Development_policy?type=revision&diff=1886519&oldid=1884610 is what you meant to say, can you review? [21:21:17] (03CR) 1020after4: "I'm going to refactor this to address the concerns Dan mentioned (above) and the discussion I had with Tyler just now via google Hangouts." [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [21:22:34] !log ori@tin Synchronized php-1.26wmf23/includes/resourceloader/ResourceLoaderModule.php: I952068d2d: Use MD4 to compute file hash rather than SHA1 (duration: 00m 12s) [21:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:48] !log ori@tin Synchronized php-1.26wmf22/includes/resourceloader/ResourceLoaderModule.php: I952068d2d: Use MD4 to compute file hash rather than SHA1 (duration: 00m 13s) [21:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:42] (03PS35) 10Ori.livneh: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [21:29:49] (03CR) 10jenkins-bot: [V: 04-1] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [21:32:54] (03PS36) 10Ori.livneh: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [21:34:01] 6operations, 7Database: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1651444 (10csteipp) @jcrespo, it seems like adding /proc/loadavg would be a useful metric to adjust on. But it might be a naive assumption that load is a good p... [21:36:06] 6operations, 10Wikimedia-Mailing-lists: education-collab-owner@lists.wikimedia.org bounce notification - https://phabricator.wikimedia.org/T112912#1651467 (10Pine) We're having this same problem on the Cascadia mailing list. [21:39:35] 6operations, 10Wikimedia-Mailing-lists: @txt.att.net bounce notifications being sent to list admins - https://phabricator.wikimedia.org/T112912#1651486 (10Krenair) [21:39:45] 6operations, 10Wikimedia-Mailing-lists: @txt.att.net bounce notifications being sent to list admins - https://phabricator.wikimedia.org/T112912#1651489 (10siebrand) Same for mediawiki-commits. [21:40:18] (03PS3) 10Yuvipanda: RESTBase: Update config [puppet] - 10https://gerrit.wikimedia.org/r/239133 (owner: 10Mobrovac) [21:40:26] (03CR) 10Yuvipanda: [C: 032 V: 032] RESTBase: Update config [puppet] - 10https://gerrit.wikimedia.org/r/239133 (owner: 10Mobrovac) [21:40:44] 6operations, 10Wikimedia-Mailing-lists: @txt.att.net bounce notifications being sent to list admins - https://phabricator.wikimedia.org/T112912#1651500 (10Barras) Same for checkuser-l, overight-wp-simple-l, cvn-l, publicpolicy and probably any other list I administer (I already deleted most mails) [21:41:12] yuvipanda: yupiii [21:41:16] yuvipanda: cheers [21:42:25] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Puppet last ran 23 hours ago [21:42:36] 6operations, 10Wikimedia-Mailing-lists: @txt.att.net bounce notifications being sent to list admins - https://phabricator.wikimedia.org/T112912#1651503 (10Krenair) Okay, I think it's clear now that this is happening on plenty of lists. :) [21:42:36] ignore ^^ [21:42:53] (the puppet run icinga, not the ticket :P ) [21:44:14] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:44:46] 6operations, 10Wikimedia-Mailing-lists: Maps-l: Disable or re-assign moderators - https://phabricator.wikimedia.org/T110962#1651519 (10Yurik) we started using the list again, and instantly lots of positive communication. Please reassign adminship to @maxsem, @tfinc, and I. Thanks! [21:47:35] 6operations, 10Wikimedia-Mailing-lists: @txt.att.net bounce notifications being sent to list admins - https://phabricator.wikimedia.org/T112912#1651538 (10Lydia_Pintscher) Also Wikidata. [21:50:43] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Puppet last ran 23 hours ago [21:51:33] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Puppet last ran 23 hours ago [21:52:24] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [21:52:59] (03CR) 10Legoktm: [C: 04-1] "This will cause warnings because $wgQuickSurveysConfig will not be defined when you try and array_merge with it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [21:53:14] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:53:56] (03CR) 10Jdlrobson: "Why will it not be defined?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [21:54:37] (03CR) 10Jdlrobson: "(it's worth noting that previously $wgQuickSurveysConfig[] was working suggesting the array is defined already)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [21:55:22] 6operations, 10Wikimedia-Mailing-lists: Maps-l: Disable or re-assign moderators - https://phabricator.wikimedia.org/T110962#1651579 (10Dzahn) Done. Added Max, Yurik and Tomasz. Moved Tim Bartel to moderator. //Maps-l list run by msemenik at wikimedia.org, yurik at wikimedia.org, tfinc at wikimedia.org// yo... [21:57:06] 6operations, 10Wikimedia-Mailing-lists: Maps-l: Disable or re-assign moderators - https://phabricator.wikimedia.org/T110962#1651595 (10Dzahn) 5Open>3Resolved [21:57:08] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1651596 (10Dzahn) [21:58:28] (03PS5) 10Andrew Bogott: toolschecker: test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) [21:58:30] (03PS1) 10Andrew Bogott: Add read/write checks for toolsdb1001 1002 1003 [puppet] - 10https://gerrit.wikimedia.org/r/239288 [21:59:31] 6operations, 10ops-ulsfo: troubleshoot ulsfo side of IC-313592 - https://phabricator.wikimedia.org/T111101#1651601 (10RobH) 5Open>3Resolved This was resolved last week, and I neglected to resolve this task. There was an ongoing out of band support ticket with Carlos @ UnitedLayer. It turned out to be a b... [22:00:38] 6operations, 10Wikimedia-Mailing-lists: @txt.att.net bounce notifications being sent to list admins - https://phabricator.wikimedia.org/T112912#1651607 (10Anthere) Also fundraiser [22:01:11] (03CR) 10Andrew Bogott: [C: 032] Add read/write checks for toolsdb1001 1002 1003 [puppet] - 10https://gerrit.wikimedia.org/r/239288 (owner: 10Andrew Bogott) [22:04:05] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=627.62 Read Requests/Sec=274.12 Write Requests/Sec=0.50 KBytes Read/Sec=1096.48 KBytes_Written/Sec=2.01 [22:04:17] (03CR) 10Dzahn: [C: 032] squid: logrotate for webproxy on carbon [puppet] - 10https://gerrit.wikimedia.org/r/239009 (https://phabricator.wikimedia.org/T97119) (owner: 10Dzahn) [22:04:24] (03PS5) 10Dzahn: squid: logrotate for webproxy on carbon [puppet] - 10https://gerrit.wikimedia.org/r/239009 (https://phabricator.wikimedia.org/T97119) [22:05:53] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.60 Read Requests/Sec=0.00 Write Requests/Sec=0.60 KBytes Read/Sec=0.00 KBytes_Written/Sec=2.40 [22:07:38] (03PS6) 10Andrew Bogott: toolschecker: test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) [22:07:40] (03PS1) 10Andrew Bogott: toolschecker: Fix function names for rw toolsdb checks [puppet] - 10https://gerrit.wikimedia.org/r/239290 [22:10:53] (03CR) 10Andrew Bogott: [C: 032] toolschecker: Fix function names for rw toolsdb checks [puppet] - 10https://gerrit.wikimedia.org/r/239290 (owner: 10Andrew Bogott) [22:11:33] PROBLEM - NFS read/writeable on labs instances on labstore1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string OK not found on http://tools-checker.wmflabs.org:80/nfs/home - 184 bytes in 0.017 second response time [22:11:41] 6operations, 10Wikimedia-Mailing-lists: @txt.att.net bounce notifications being sent to list admins - https://phabricator.wikimedia.org/T112912#1651660 (10Deskana) AND MY AXE! Wait, I mean, and my mailing lists. The Discovery Department's mailing lists are affected by this too. [22:11:42] (03PS6) 10Dzahn: squid: logrotate for webproxy on carbon [puppet] - 10https://gerrit.wikimedia.org/r/239009 (https://phabricator.wikimedia.org/T97119) [22:13:14] RECOVERY - NFS read/writeable on labs instances on labstore1002 is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.109 second response time [22:13:41] 6operations, 10Wikimedia-Mailing-lists: @txt.att.net bounce notifications being sent to list admins - https://phabricator.wikimedia.org/T112912#1651678 (10JohnLewis) p:5Triage>3Normal As Krenair said, enough lists is enough. This ticket is likely generating more spam than these bounces together. I'll take... [22:14:42] 6operations, 6Commons: delete gwtoolset job by Hansmuller - https://phabricator.wikimedia.org/T112878#1651693 (10Tgr) 5Open>3Resolved a:3Tgr Seems to be stopped for good. See T100972 for the long-term handling of such problems. [22:17:57] 6operations, 10Wikimedia-Mailing-lists: @txt.att.net bounce notifications being sent to list admins - https://phabricator.wikimedia.org/T112912#1651708 (10eliza) Thank you everyone ! [22:20:24] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [22:26:45] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:29:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:29:57] Hey, it looks like the meetbot logs for June 10 are gone, link http://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-06-10-21.02.html in https://phabricator.wikimedia.org/T99268#1355187 is a 404 [22:30:10] sorry, seems like a labs issue [22:32:21] gwicke: there are already non-WMF users of jessie parsoid 0.4.1. on jessie now [22:33:01] mutante: yeah, at least cscott ;) [22:33:18] gwicke: < icinga-miraheze> RECOVERY - Parsoid on parsoid1 is OK: TCP OK miraheze.org [22:33:34] ah, nice [22:33:41] mutante: sure link the recovery :P [22:33:54] afaik, most parsoid users go with the deb [22:33:56] what else?:) that's the good part [22:34:13] we shall create one for RB too [22:34:14] the log I did saying specifically upgrading might have been more relevant [22:36:20] 6operations, 10RESTBase, 6Services: RESTBase logging broken in both production & staging - https://phabricator.wikimedia.org/T112985#1651771 (10mobrovac) [22:53:10] !log puppet on restbase cluster disabled since about 21:30 UTC for gradual deploy; ran into minor issue in staging, which is now being addressed, after which deploy will continue [22:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150917T2300). [23:00:04] tgr: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:02:05] 6operations, 10RESTBase, 6Services: RESTBase logging broken in both production & staging - https://phabricator.wikimedia.org/T112985#1651851 (10GWicke) [23:02:48] (03CR) 10Mattflaschen: [C: 032] Set $wgFlowMigrateReferenceWiki false on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234207 (https://phabricator.wikimedia.org/T107204) (owner: 10Mattflaschen) [23:03:18] (03Merged) 10jenkins-bot: Set $wgFlowMigrateReferenceWiki false on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234207 (https://phabricator.wikimedia.org/T107204) (owner: 10Mattflaschen) [23:04:35] 6operations, 10RESTBase, 6Services: RESTBase logging broken in both production & staging - https://phabricator.wikimedia.org/T112985#1651859 (10mobrovac) > were there any changes in logstash around that time? [SAL](https://wikitech.wikimedia.org/wiki/Server_Admin_Log) says there were (seemingly unrelated) c... [23:04:52] !log mattflaschen@tin Synchronized wmf-config/CommonSettings-labs.php: Beta-only change (duration: 00m 12s) [23:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:28] !log mattflaschen@tin Synchronized wmf-config/CommonSettings-labs.php: Beta-only change (duration: 00m 12s) [23:10:44] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 1 failures [23:11:24] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [23:12:09] 6operations, 10Wikimedia-Mailing-lists: @txt.att.net bounce notifications being sent to list admins - https://phabricator.wikimedia.org/T112912#1651887 (10Platonides) I saw 12 of them in 24 hours for one of the lists I administer. All of them for the same number as in the above report, but note it was @**mms*... [23:14:09] 6operations, 10RESTBase, 6Services: RESTBase logging broken in both production & staging - https://phabricator.wikimedia.org/T112985#1651892 (10GWicke) I didn't find anything obviously related in the puppet git log either. Only candidates form SAL: - some elasticsearch changes on the 15th - lots of FERM f... [23:18:24] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:36:54] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:37:37] (03PS1) 10Dzahn: build 1.7.13 for trusty (T109947) [debs/adminbot] - 10https://gerrit.wikimedia.org/r/239304 [23:38:18] RoanKattouw ostriches rmoen Krenair: anyone doing SWAT today? [23:38:30] I can do self-service if everyone's busy [23:38:54] sure [23:38:59] (03CR) 10Alex Monk: [C: 032] Enable authmetrics logging everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238978 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [23:39:20] (03Merged) 10jenkins-bot: Enable authmetrics logging everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238978 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [23:39:54] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/238978/ (duration: 00m 12s) [23:39:56] tgr, ^ [23:39:59] thanks! [23:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:38] (03CR) 10Dzahn: [C: 032] Fix linting issues [debs/adminbot] - 10https://gerrit.wikimedia.org/r/237488 (owner: 10Hashar) [23:41:55] (03CR) 10Dzahn: [C: 032] Add tox and flake8 [debs/adminbot] - 10https://gerrit.wikimedia.org/r/237489 (owner: 10Hashar) [23:41:57] (03Merged) 10jenkins-bot: Add tox and flake8 [debs/adminbot] - 10https://gerrit.wikimedia.org/r/237489 (owner: 10Hashar) [23:46:35] (03PS2) 10Dzahn: build 1.7.13 for trusty (T109947) [debs/adminbot] - 10https://gerrit.wikimedia.org/r/239304 [23:47:03] (03CR) 10Dzahn: [C: 032] build 1.7.13 for trusty (T109947) [debs/adminbot] - 10https://gerrit.wikimedia.org/r/239304 (owner: 10Dzahn) [23:47:05] (03Merged) 10jenkins-bot: build 1.7.13 for trusty (T109947) [debs/adminbot] - 10https://gerrit.wikimedia.org/r/239304 (owner: 10Dzahn) [23:53:40] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1652011 (10Eevans) The rebuilds of codfw nodes is complete ``` Datacenter: codfw ================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -...