[00:00:57] jamesofur, I see: https://en.wikisource.org/wiki/Special:Contributions/Jamesofur :D [00:02:40] MaxSem: My editing history (on either of my accounts :P ) does not give a very good sign of my reading or diff usage :P [00:03:02] (03CR) 10Quiddity: "Where exactly do the "wmgMFRemovableClasses => extracts" get used currently?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126226 (owner: 10Prtksxna) [00:05:18] I really should get back to my wikisource project, sometime... So many projects! >.< [00:19:59] (03PS14) 10BryanDavis: [WIP] Configure scap master and clients in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 [00:26:12] mwalker & Reedy, https://gerrit.wikimedia.org/r/126880 [00:59:04] yurik: why wouldn't we? [00:59:44] PROBLEM - Puppet freshness on db1056 is CRITICAL: Last successful Puppet run was Wed 16 Apr 2014 06:54:47 AM UTC [01:42:25] !log xtrabackup db63 to db60 [01:42:35] Logged the message, Master [02:03:46] !log stop mysqld on db35 (m1) for decom [02:03:54] Logged the message, Master [02:07:02] (03PS1) 10Springle: Remove db35 from m1. [operations/puppet] - 10https://gerrit.wikimedia.org/r/126898 [02:08:53] (03CR) 10Springle: [C: 032] Remove db35 from m1. [operations/puppet] - 10https://gerrit.wikimedia.org/r/126898 (owner: 10Springle) [02:11:44] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 2911 MB (3% inode=99%): [02:18:44] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3450 MB (3% inode=99%): [02:22:43] * springle kicks neon [02:33:49] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-17 02:33:47+00:00 [02:33:55] Logged the message, Master [03:00:44] RECOVERY - Disk space on virt0 is OK: DISK OK [03:02:27] !log LocalisationUpdate completed (1.23wmf22) at 2014-04-17 03:02:25+00:00 [03:02:33] Logged the message, Master [03:06:29] !log deployed Parsoid 0bccf02c (deploy SHA 5e25f3b05) @ 1:30 pm PST, Apr 16th, 2014 [03:06:35] Logged the message, Master [03:49:24] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Apr 17 03:49:19 UTC 2014 (duration 49m 18s) [03:49:29] Logged the message, Master [04:00:00] PROBLEM - Puppet freshness on db1056 is CRITICAL: Last successful Puppet run was Wed 16 Apr 2014 06:54:47 AM UTC [04:18:20] PROBLEM - Disk space on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:38] (03CR) 10Ori.livneh: [C: 032] Remove unneeded priority settings [operations/puppet] - 10https://gerrit.wikimedia.org/r/126828 (owner: 10Ori.livneh) [05:55:43] (03PS2) 10Nemo bis: Adding '*.panoramio.com' to the wgCopyUploadsDomains array [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126384 (owner: 10Marco) [06:00:35] (03PS3) 10Nemo bis: Adding '*.panoramio.com' to the wgCopyUploadsDomains array [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126384 (owner: 10Marco) [06:00:55] (03CR) 10Nemo bis: [C: 031] "Alright, did the bureaucracy for you" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126384 (owner: 10Marco) [06:41:07] (03CR) 10Dzahn: "since Change-Id: I6b9c47055e7 (also see Change-Id: I6d6250e69) we replaced virtual packages and i think this is one that pulls ttf-dejavu" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126834 (owner: 10Brian Wolff) [06:44:04] (03PS3) 10Dzahn: Add ttf-dejavu to image scalers for "DejaVu (Sans|Serif) Condensed". [operations/puppet] - 10https://gerrit.wikimedia.org/r/126834 (owner: 10Brian Wolff) [06:44:43] (03PS4) 10Dzahn: Add ttf-dejavu-core,ttf-dejavu-extra to image scalers for "DejaVu (Sans|Serif) Condensed". [operations/puppet] - 10https://gerrit.wikimedia.org/r/126834 (owner: 10Brian Wolff) [06:55:02] (03PS5) 10Giuseppe Lavagetto: Add ttf-dejavu-core,ttf-dejavu-extra to image scalers for "DejaVu (Sans|Serif) Condensed". [operations/puppet] - 10https://gerrit.wikimedia.org/r/126834 (owner: 10Brian Wolff) [06:56:05] (03CR) 10Giuseppe Lavagetto: [C: 032] Add ttf-dejavu-core,ttf-dejavu-extra to image scalers for "DejaVu (Sans|Serif) Condensed". [operations/puppet] - 10https://gerrit.wikimedia.org/r/126834 (owner: 10Brian Wolff) [07:01:00] PROBLEM - Puppet freshness on db1056 is CRITICAL: Last successful Puppet run was Wed 16 Apr 2014 06:54:47 AM UTC [07:09:36] (03CR) 10Dzahn: [C: 032] remove virt15 from DHCP, decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/126250 (owner: 10Dzahn) [07:42:00] (03PS1) 10Dzahn: decom db35,db38, remove from dsh, dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/126928 [07:43:31] (03CR) 10Dzahn: [C: 032] decom db35,db38, remove from dsh, dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/126928 (owner: 10Dzahn) [07:45:14] !log restarting gitblit [07:45:20] Logged the message, Master [07:47:53] !log db35,db38, stop puppet and salt, revoke certs,keys [07:47:59] Logged the message, Master [08:02:46] (03PS1) 10Dzahn: remove lvs1-4 from dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/126929 [08:04:50] (03CR) 10Dzahn: [C: 032] remove lvs1-4 from dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/126929 (owner: 10Dzahn) [08:06:00] (03CR) 10Dzahn: [C: 032] rm wap.wikipedia.org apache site [operations/puppet] - 10https://gerrit.wikimedia.org/r/126227 (owner: 10Dzahn) [08:26:57] (03CR) 10Odder: [C: 031] Adding '*.panoramio.com' to the wgCopyUploadsDomains array [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126384 (owner: 10Marco) [08:28:54] !log db35,db38 - shutdown [08:29:00] Logged the message, Master [08:32:40] (03PS1) 10Dzahn: add wiktionary.eu, link to wiktionary.org [operations/dns] - 10https://gerrit.wikimedia.org/r/126932 [08:33:00] twkozlowski: ^ [08:33:31] \o/ [08:33:42] (03CR) 10Dzahn: "please check status on #7304" [operations/dns] - 10https://gerrit.wikimedia.org/r/126932 (owner: 10Dzahn) [08:47:41] PROBLEM - Disk space on labstore1001 is CRITICAL: DISK CRITICAL - free space: /exp/dumps 379584 MB (3% inode=99%): [08:56:08] (03PS1) 10Dzahn: remove sumanah's LDAP admin permissions [operations/puppet] - 10https://gerrit.wikimedia.org/r/126935 [08:58:51] (03PS2) 10Dzahn: remove sumanah's LDAP admin permissions [operations/puppet] - 10https://gerrit.wikimedia.org/r/126935 [09:01:03] apergos: /ext/dumps on labstore1001 is nearly full apparently . I guess that is related to the wiki xml dumps ? :-) [09:02:03] (03CR) 10Dzahn: [C: 032] "with the key already being absent these can't be used anyways. and added a 'revoked' comment" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126935 (owner: 10Dzahn) [09:04:12] ugh really? [09:04:21] how full is nearly full? [09:04:58] (3% inode=99%): [09:05:14] let me rephrase that, how much space is it using? [09:07:50] what we should do is deal with this: https://bugzilla.wikimedia.org/show_bug.cgi?id=48894 [09:09:06] dont know, just relaying the icinga notification :D [09:09:17] saying there is 'only' 380GB of disk space left [09:10:52] it just cares about percentages, the larget the disk the earlier the warning [09:13:03] well that amount of space will get eaten over time because we don't cap the number of pageview files, which is what that bug is about [09:13:04] (03PS1) 10Odder: Redirect wiktionary.eu to www.wiktionary.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/126937 [09:13:16] (03CR) 10jenkins-bot: [V: 04-1] Redirect wiktionary.eu to www.wiktionary.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/126937 (owner: 10Odder) [09:17:23] (03CR) 10Dzahn: "ok, so i just looked at bast1001 in site.pp for something else and i see this:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [09:26:25] (03CR) 10Alexandros Kosiaris: [C: 032] Sysctl: make the default priority 70 [operations/puppet] - 10https://gerrit.wikimedia.org/r/126839 (owner: 10Ori.livneh) [09:31:12] (03CR) 10Gilles: [C: 031] Add ttf-kochi-mincho and ttf-kochi-gothic to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/126729 (owner: 10Reedy) [09:32:05] (03PS2) 10Odder: Redirect wiktionary.eu to www.wiktionary.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/126937 [09:36:41] RECOVERY - Disk space on labstore1001 is OK: DISK OK [09:39:05] (03PS1) 10Dzahn: remove admins::restricted from lucene role [operations/puppet] - 10https://gerrit.wikimedia.org/r/126939 [09:40:07] (03CR) 10jenkins-bot: [V: 04-1] remove admins::restricted from lucene role [operations/puppet] - 10https://gerrit.wikimedia.org/r/126939 (owner: 10Dzahn) [09:40:50] (03PS2) 10Dzahn: remove admins::restricted from lucene role [operations/puppet] - 10https://gerrit.wikimedia.org/r/126939 [09:43:43] (03CR) 10Dzahn: "this is what this removes" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126014 (owner: 10Dzahn) [09:47:31] (03PS1) 10Dzahn: remove admins::restricted from terbium,fluorine [operations/puppet] - 10https://gerrit.wikimedia.org/r/126941 [09:49:24] (03CR) 10Dzahn: "also see: Change-Id: Iad35d5707dc" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [10:01:31] PROBLEM - Puppet freshness on db1056 is CRITICAL: Last successful Puppet run was Wed 16 Apr 2014 06:54:47 AM UTC [10:11:19] !log lvs1-6 - disable puppet,salt,revoke certs,keys [10:11:25] Logged the message, Master [10:15:55] niiiiice [10:15:57] !log re-deleting unaccepted salt keys for virt2,5-11 [10:16:05] Logged the message, Master [10:35:27] (03CR) 10Matanya: "what about hooft in esams?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126941 (owner: 10Dzahn) [10:38:41] apergos: can you please comment on this one? ^ does hooft have private data? [10:40:47] (03PS1) 10Hashar: contint: remove ruby-bundler outdated package [operations/puppet] - 10https://gerrit.wikimedia.org/r/126953 [10:42:00] !log lvs1, lvs2 shutdown [10:42:05] Logged the message, Master [10:42:24] (03CR) 10Matanya: [C: 031] remove sudo::appserver from bastions [operations/puppet] - 10https://gerrit.wikimedia.org/r/126014 (owner: 10Dzahn) [10:43:09] (03CR) 10Hashar: [C: 031 V: 032] "Found out that some Jenkins jobs was falling because of the old bundle version :-D" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126953 (owner: 10Hashar) [10:43:29] not by design, and note that mortals and restricted already have access over there now [10:48:01] (03PS1) 10Dzahn: remove lvs1-6 [operations/dns] - 10https://gerrit.wikimedia.org/r/126954 [10:50:11] (03PS2) 10Dzahn: remove lvs1-6 lvs1-6.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/126954 [10:51:31] (03PS1) 10Faidon Liambotis: ganglia_new: add esams to Swift's sites [operations/puppet] - 10https://gerrit.wikimedia.org/r/126956 [10:52:02] (03CR) 10Faidon Liambotis: [C: 032] ganglia_new: add esams to Swift's sites [operations/puppet] - 10https://gerrit.wikimedia.org/r/126956 (owner: 10Faidon Liambotis) [10:53:43] wow there they go... bye bye lvses [10:55:34] :D [10:56:13] !log lvs3,lvs4,lvs5,lvs6 - shutdown [10:56:19] Logged the message, Master [10:56:40] PROBLEM - Host ms-fe.pmtpa.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.1.27) [10:56:55] heh [10:57:17] uhm [10:58:11] ACKNOWLEDGEMENT - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: Connection timed out daniel_zahn LVS have been shutdown, service IPs removed [10:58:25] had removed the other monitoring but not swift [10:58:50] heh [10:58:56] (just got the page) [11:01:22] (03PS1) 10Dzahn: remove ms-fe.pmtpa monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/126957 [11:01:45] sorry, that was just left because i once thought we'd keep swift longer [11:01:50] nope [11:01:54] kill it! [11:02:06] yep, ok [11:02:12] killing it with that change above [11:02:22] the service IP/monitoring I mean [11:02:26] yea [11:02:27] the servers we can keep for another week or so [11:02:36] when robh/chris go there [11:02:49] ok [11:03:29] (03CR) 10Dzahn: [C: 032] "this was the last "pmtpa" in here" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126957 (owner: 10Dzahn) [11:04:20] (03PS5) 10Reedy: Remove further pmtpa remnants [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126673 [11:04:33] (03CR) 10Reedy: [C: 032] "Bye bye tampa, bye bye." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126673 (owner: 10Reedy) [11:04:40] (03Merged) 10jenkins-bot: Remove further pmtpa remnants [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126673 (owner: 10Reedy) [11:05:54] !log reedy synchronized wmf-config/ 'I290bd1ea628563646c02651041fa2cec4a320b56' [11:06:01] Logged the message, Master [11:18:56] (03PS2) 10Zfilipin: contint: remove ruby-bundler outdated package [operations/puppet] - 10https://gerrit.wikimedia.org/r/126953 (owner: 10Hashar) [11:24:44] (03PS3) 10Zfilipin: contint: remove ruby-bundler outdated package [operations/puppet] - 10https://gerrit.wikimedia.org/r/126953 (owner: 10Hashar) [11:25:53] (03CR) 10Zfilipin: [C: 031] contint: remove ruby-bundler outdated package [operations/puppet] - 10https://gerrit.wikimedia.org/r/126953 (owner: 10Hashar) [11:32:08] (03PS1) 10Faidon Liambotis: ganglia: add cluster Swift esams, remove Ceph esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/126960 [11:33:09] (03PS2) 10Faidon Liambotis: ganglia: add group Swift esams, remove Ceph esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/126960 [11:33:28] (03CR) 10Faidon Liambotis: [C: 032 V: 032] ganglia: add group Swift esams, remove Ceph esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/126960 (owner: 10Faidon Liambotis) [11:46:54] (03PS1) 10Dzahn: add role ldap operations on silver [operations/puppet] - 10https://gerrit.wikimedia.org/r/126961 [11:57:23] (03CR) 10Dzahn: [C: 032] bugzilla,make Apache SSL CipherSuite configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/126204 (owner: 10Dzahn) [12:01:01] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:01:47] ignore the upcoming swift esams alerts [12:01:51] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [12:03:12] PROBLEM - swift-account-replicator on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:03:21] PROBLEM - Swift HTTP backend on ms-fe3002 is CRITICAL: Connection refused [12:03:21] PROBLEM - swift-container-updater on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:03:21] PROBLEM - swift-account-auditor on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:03:22] PROBLEM - swift-account-server on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:03:22] PROBLEM - swift-account-replicator on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:03:22] PROBLEM - swift-container-server on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:03:22] PROBLEM - swift-object-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:03:22] PROBLEM - swift-object-replicator on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:03:23] PROBLEM - swift-object-updater on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:03:23] PROBLEM - swift-container-auditor on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:03:24] PROBLEM - swift-object-auditor on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:03:24] PROBLEM - swift-object-server on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:03:25] PROBLEM - swift-account-reaper on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:03:25] PROBLEM - swift-container-updater on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:03:31] PROBLEM - swift-account-server on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:03:31] PROBLEM - swift-object-replicator on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:03:41] PROBLEM - swift-container-server on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:03:41] PROBLEM - swift-object-server on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:03:41] PROBLEM - swift-container-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:03:41] PROBLEM - swift-account-replicator on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:03:41] PROBLEM - swift-account-auditor on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:03:41] PROBLEM - swift-container-updater on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:03:41] PROBLEM - swift-object-updater on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:03:42] PROBLEM - swift-object-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:03:42] PROBLEM - swift-container-replicator on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:03:43] PROBLEM - swift-container-auditor on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:03:51] PROBLEM - swift-object-updater on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:03:51] PROBLEM - swift-object-auditor on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:03:51] PROBLEM - swift-container-auditor on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:03:51] PROBLEM - swift-object-replicator on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:03:51] PROBLEM - swift-container-replicator on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:03:52] PROBLEM - swift-account-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:03:52] PROBLEM - swift-object-updater on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:03:53] PROBLEM - swift-account-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:04:01] PROBLEM - Swift HTTP frontend on ms-fe3002 is CRITICAL: Connection refused [12:04:01] PROBLEM - swift-container-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:04:01] PROBLEM - Swift HTTP frontend on ms-fe3001 is CRITICAL: Connection refused [12:04:01] PROBLEM - swift-account-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:04:01] PROBLEM - swift-container-updater on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:04:01] PROBLEM - swift-account-reaper on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:04:11] PROBLEM - Swift HTTP backend on ms-fe3001 is CRITICAL: Connection refused [12:04:11] PROBLEM - swift-account-reaper on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:04:11] PROBLEM - swift-account-reaper on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:04:11] PROBLEM - swift-account-server on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:04:11] PROBLEM - swift-object-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:04:11] PROBLEM - swift-object-server on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:04:12] PROBLEM - swift-account-auditor on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:04:12] PROBLEM - swift-container-replicator on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:04:13] PROBLEM - swift-container-server on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:04:13] PROBLEM - swift-container-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:04:14] PROBLEM - swift-object-auditor on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:05:30] heh [12:05:34] this is noisy [12:05:43] we should probably fix the checks at some point... [12:09:41] RECOVERY - swift-object-server on ms-be3002 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:09:41] RECOVERY - swift-container-server on ms-be3004 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:09:41] RECOVERY - swift-container-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:09:41] RECOVERY - swift-account-replicator on ms-be3004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:09:41] RECOVERY - swift-object-updater on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:09:41] RECOVERY - swift-account-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:09:41] RECOVERY - swift-container-updater on ms-be3004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:09:42] RECOVERY - swift-container-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:09:42] RECOVERY - swift-object-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:09:43] RECOVERY - swift-container-auditor on ms-be3004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:09:44] (03CR) 10Dzahn: [C: 031] Remove mysql client from bastionhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [12:09:51] RECOVERY - swift-object-auditor on ms-be3004 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:09:51] RECOVERY - swift-object-updater on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:09:51] RECOVERY - swift-container-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:09:51] RECOVERY - swift-account-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:09:51] RECOVERY - swift-container-replicator on ms-be3004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:09:52] RECOVERY - swift-object-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:09:52] RECOVERY - swift-object-updater on ms-be3004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:09:52] RECOVERY - swift-account-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:10:01] RECOVERY - Swift HTTP frontend on ms-fe3002 is OK: HTTP OK: HTTP/1.1 200 OK - 137 bytes in 0.196 second response time [12:10:01] RECOVERY - swift-container-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:10:01] RECOVERY - Swift HTTP frontend on ms-fe3001 is OK: HTTP OK: HTTP/1.1 200 OK - 137 bytes in 0.197 second response time [12:10:01] RECOVERY - swift-account-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:10:01] RECOVERY - swift-container-updater on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:10:01] RECOVERY - swift-account-reaper on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:10:11] RECOVERY - Swift HTTP backend on ms-fe3001 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.207 second response time [12:10:11] RECOVERY - swift-object-server on ms-be3003 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:10:11] RECOVERY - swift-account-reaper on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:10:11] RECOVERY - swift-account-server on ms-be3001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:10:11] RECOVERY - swift-account-reaper on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:10:12] RECOVERY - swift-object-auditor on ms-be3001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:10:12] RECOVERY - swift-container-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:10:13] RECOVERY - swift-container-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:10:13] RECOVERY - swift-container-server on ms-be3002 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:10:14] RECOVERY - swift-object-server on ms-be3004 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:10:14] RECOVERY - swift-account-auditor on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:10:15] RECOVERY - swift-account-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:10:21] RECOVERY - Swift HTTP backend on ms-fe3002 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.209 second response time [12:10:21] RECOVERY - swift-container-updater on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:10:21] RECOVERY - swift-account-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:10:21] RECOVERY - swift-account-auditor on ms-be3004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:10:21] RECOVERY - swift-container-server on ms-be3001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:10:21] RECOVERY - swift-object-updater on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:10:22] RECOVERY - swift-account-server on ms-be3002 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:10:22] RECOVERY - swift-object-replicator on ms-be3004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:10:23] RECOVERY - swift-object-auditor on ms-be3003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:10:23] RECOVERY - swift-container-auditor on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:10:24] RECOVERY - swift-object-server on ms-be3001 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:10:24] RECOVERY - swift-object-auditor on ms-be3002 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:10:25] RECOVERY - swift-account-reaper on ms-be3004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:10:25] RECOVERY - swift-container-updater on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:10:31] RECOVERY - swift-account-server on ms-be3004 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:10:31] RECOVERY - swift-object-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:25:11] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=ms-fe3002.esams.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Swift+esams [12:25:18] and this folks is called a saturated gbps [12:25:35] (03CR) 10Dzahn: "do you still want this? mind if i amend? i'd add the recurse attribute as suggested above. so it's just a single file resource but you can" [operations/puppet] - 10https://gerrit.wikimedia.org/r/76678 (owner: 10Tim Starling) [12:27:51] urk [12:29:01] no, that's okay [12:29:05] I'm copying files over to esams [12:29:11] the faster it goes, the better [12:29:22] ah:) [12:29:32] all images? [12:29:39] how long do you expect it takes [12:30:05] <_joe_> paravoid: pretty saturated indeed [12:30:48] <_joe_> paravoid: we are just creating the thumbnails in esams, right? [12:31:59] not thumbs, originals [12:32:22] just to have another copy while tampa is on the move [12:32:26] in case eqiad burns down or something [12:32:30] <_joe_> uh, so it's cross-site replicated, ok [12:32:39] <_joe_> sorry, didn't know :) [12:33:46] cool, reassuring to have another copy [12:34:05] it will take less than a week [12:34:14] probably closer to 5 days [12:34:16] <_joe_> writes still go to eqiad, right? [12:35:08] <_joe_> ok, I'll look those details up on wikitech and in puppet [12:35:30] <_joe_> (and mediawiki/config, I guess) [12:36:17] everything goes to eqiad, yes [12:36:22] the esams one is not going to be used by production [12:36:48] we just have a python script running on copper (a random misc box) that copies all the files [12:37:06] it's a oneoff, but I'll probably keep it running on a loop or something [12:37:54] (03PS1) 10Dzahn: add wikisource.pl, link to wikisource.org [operations/dns] - 10https://gerrit.wikimedia.org/r/126968 [12:38:05] <_joe_> ok, less fancy than I've anticipated, more simple :) [12:38:26] yeah it's crude [12:38:29] but will do for now [12:38:47] the longer term plan is to set up a proper georeplicated swift cluster across the two DCs [12:39:00] swift supports geoclusters nowadays, understands regions/hierarchies etc. [12:39:10] the two DCs = eqiad & the new DC, not esams [12:40:18] <_joe_> paravoid: for obvious latency reasons as well as bandwidth costs, right? [12:40:37] latency mostly [12:40:47] and legal reasons [12:41:20] (03CR) 10Dzahn: "ideally link the Apache change over here" [operations/dns] - 10https://gerrit.wikimedia.org/r/126968 (owner: 10Dzahn) [12:41:30] <_joe_> oh, honestly never thought of the implications of syncronizing free content across borders :) [12:41:59] we don't do production data outside of the US [12:42:03] just caches, backups etc. [12:42:26] because otherwise some country may come and ask us to remove a file/article/whatever [12:42:34] or at least that's my understanding of it [12:42:51] <_joe_> paravoid: yes it makes perfect sense. [12:43:20] <_joe_> I mean, not to operate under two different legislations [12:43:55] ah, while talking about images, do you know what this was? [12:44:03] imagedump.pmtpa.wmnet [12:44:22] i don't expect you will make imagedump.eqiad.wmnet? [12:45:51] I have no clue what this is [12:45:58] (03PS1) 10Odder: Redirect wikisource.pl to pl.wikisource.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/126969 [12:46:04] it says "dump" so maybe apergos knows? [12:46:11] mutante: ^^ [12:47:11] (03CR) 10Odder: "See I9332650 for the Apache patchset." [operations/dns] - 10https://gerrit.wikimedia.org/r/126968 (owner: 10Dzahn) [12:47:38] nope, predates me [12:48:36] (03CR) 10Dzahn: [C: 04-2] "Dereckson, abandon? if not please get some more attention to it again, it's sitting in our queue forever otherwise i'm afraid" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80760 (owner: 10Dereckson) [12:49:14] paravoid: apergos , then i'll just kill it, since it's pmtpa, thx [12:49:35] sounds great [12:51:28] twkozlowski: yes, cool [12:52:25] mutante: So I set DNS to ns1.wm.org and ns2.wm.org... when? [12:53:25] twkozlowski: in this case after the apache change and the dns change are live [12:53:34] since you have the working redirect.. right [12:53:39] OK, will keep an eye on it. [12:53:44] Yeah, I do. [12:53:49] nods [12:55:08] re [12:55:39] (03PS2) 10Dzahn: remove imagedump.pmtpa.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/125949 [12:57:55] (03PS3) 10Dzahn: remove imagedump.pmtpa.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/125949 [12:58:09] (03PS5) 10Manybubbles: Deploy experimental highlighter [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/123704 [12:58:51] (03CR) 10Manybubbles: "Now that If0df224b21fe589cc7dcdc7e3548d1b1693abb44 is in (and going to test sites) I'd like to get this deployed so we can try it out." [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/123704 (owner: 10Manybubbles) [13:00:33] (03CR) 10Manybubbles: [C: 031] "Chad, Andrew Otto, and I are the only folks I know who might go poking there. We only do that rarely but it is nice to be able to get in." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126939 (owner: 10Dzahn) [13:00:58] (03CR) 10Dzahn: [C: 032] remove imagedump.pmtpa.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/125949 (owner: 10Dzahn) [13:01:41] PROBLEM - Puppet freshness on db1056 is CRITICAL: Last successful Puppet run was Wed 16 Apr 2014 06:54:47 AM UTC [13:03:57] (03CR) 10Dzahn: "this would only stop them from doing that if they are not already in another admin class, like mortals, which they have if they are deploy" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126939 (owner: 10Dzahn) [13:06:05] (03PS1) 10Dzahn: remove rendering.pmtpa,rendering.svc.pmtpa [operations/dns] - 10https://gerrit.wikimedia.org/r/126971 [13:08:02] (03CR) 10Manybubbles: "Chad and I are deployers and Andrew is ops so we shouldn't need a new group." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126939 (owner: 10Dzahn) [13:16:07] (03PS1) 10Dzahn: decom, remove db35,db38 [operations/dns] - 10https://gerrit.wikimedia.org/r/126972 [13:18:41] (03CR) 10Matanya: [C: 031] add role ldap operations on silver [operations/puppet] - 10https://gerrit.wikimedia.org/r/126961 (owner: 10Dzahn) [13:18:57] mutante: can you please verify formey only has ldap left? [13:24:47] matanya_: see the whole history in 6134 [13:24:58] it should only be that, yes [13:25:27] matanya_: well, strictly, speaking.. there is something [13:25:44] which is? [13:25:55] role::deployment::test [13:26:05] webserver::php5 [13:26:06] shrug [13:26:19] i was referring to php5 [13:26:28] webserver, almost certain just leftover from being svn [13:26:40] i guess it serves /something/ [13:26:46] but if it is svn, it is ok [13:26:48] deployment::test, dunno, but it's "test" [13:26:59] that is ryan's toy [13:27:01] iirc [13:27:21] gerrit.wikimedia.org [13:27:41] is the apache site [13:33:01] (03CR) 10Dzahn: [C: 032] add role ldap operations on silver [operations/puppet] - 10https://gerrit.wikimedia.org/r/126961 (owner: 10Dzahn) [13:37:52] ^ ok, that seems to have worked [13:37:58] (03CR) 10Ottomata: "admins::bastion sounds fine to me!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [13:38:06] i can now use ldaplist on silver [13:38:13] apergos: [13:38:15] matanya: [13:38:37] (03CR) 10Ottomata: [C: 031] Deploy experimental highlighter [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/123704 (owner: 10Manybubbles) [13:38:49] so time to kill formey mutante [13:38:57] may i have the joy? :) [13:39:40] (03CR) 10Manybubbles: [C: 032 V: 032] "Good enough for me, I'll git-deploy this today and we'll start picking it up when we reinstall the nodes. I'll have to do quick rolling r" [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/123704 (owner: 10Manybubbles) [13:40:14] matanya: of making patches? sure:) [13:40:30] thanks, two minutes [13:40:45] i'm not gonna kill it right the second [13:40:49] no rush, but thanks [13:41:12] !log synced experimental highlighter to elasticsearch nodes - they'll pick it up on restart [13:41:18] Logged the message, Master [13:42:26] ottomata: morning! [13:46:45] mornin! [13:46:51] (03PS1) 10Matanya: formey: decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/126976 [13:47:18] ottomata: I'm piggy backing a plugin deploy onto your server repartitions [13:47:24] ok! [13:47:33] do you need to restart all daemons? [13:48:31] which reminds me, i'm going to start moving shards off of 1016, s'ok? [13:49:18] ottomata: any news regarding emery? [13:49:47] ja, sqstat is off [13:50:00] i've made some commits to continue the decom, need to get erbium running on unicast udp2log stream [13:50:07] been busy with other stuff [13:50:13] but i think we can move forward with it [13:50:41] need to merge the unicast patch before [13:54:03] (03PS1) 10Matanya: formey:decom [operations/dns] - 10https://gerrit.wikimedia.org/r/126978 [14:02:35] !log reedy updated /a/common to {{Gerrit|I290bd1ea6}}: Remove further pmtpa remnants [14:02:40] Logged the message, Master [14:03:43] (03PS1) 10Reedy: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126979 [14:03:46] (03PS1) 10Reedy: testwiki to 1.24wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126980 [14:03:47] (03PS1) 10Reedy: Wikipedias to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126981 [14:03:50] (03PS1) 10Reedy: Rest of group0 to 1.24wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126982 [14:03:58] (03CR) 10Reedy: [C: 032] Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126979 (owner: 10Reedy) [14:04:10] (03Merged) 10jenkins-bot: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126979 (owner: 10Reedy) [14:09:47] (03PS3) 10BBlack: Update Zero netmapper data from zero.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/126829 [14:14:59] (03PS1) 10Manybubbles: Elasticsearch site plugins [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/126986 [14:17:10] (03PS1) 10Bartosz Dziewoński: Remove $wmgUseMicroDesign [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126987 [14:20:44] manybubbles: you ok if I move shards off of 1016? [14:20:53] ottomata: go ahead! [14:21:06] k its going [14:24:31] (03PS1) 10Bartosz Dziewoński: Remove $wmgUsabilityEnforce [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126989 [14:42:39] (03PS1) 10Andrew Bogott: Added role::labs::lvm::biglogs [operations/puppet] - 10https://gerrit.wikimedia.org/r/126992 [14:43:03] Coren: ^ [14:43:41] (03CR) 10BryanDavis: [C: 031] Elasticsearch site plugins [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/126986 (owner: 10Manybubbles) [14:44:18] andrewbogott: The whole disk? [14:44:40] hm... [14:45:28] bd808: I'm around-ish for the SWAT but mobile. Hope that's OK. [14:48:33] James_F: Ok with me because I'm not doing the swat :) [14:49:39] (03CR) 10coren: [C: 031] "Might be better parametrizable, but that works." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126992 (owner: 10Andrew Bogott) [14:49:44] (03CR) 10Hashar: "Random moods :-] That managed to get scap deployed on beta cluster but I have the feeling that much more work will need to be done for pro" (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 (owner: 10BryanDavis) [14:51:41] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [14:52:21] James_F|Away, RoanKattouw_away: SWAT in about 10 minutes. Ready to test it to make sure it didn't break anything? [14:52:49] manybubbles: Unless you really want to, I'll take the SWAT this morning. [14:53:08] anomie: have fun! I'm happy to if you don't want to, though [14:53:41] manybubbles: I may as well. But then I may drop offline to concentrate on actual coding. [14:53:55] anomie: good man [14:55:21] PROBLEM - Varnish HTTP text-frontend on cp1053 is CRITICAL: Connection timed out [14:55:22] PROBLEM - Varnish HTTP text-frontend on cp1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:31] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:38] I'm getting 502 errors. [14:55:41] PROBLEM - Varnish HTTP text-frontend on cp1067 is CRITICAL: Connection timed out [14:55:41] PROBLEM - Varnish HTTP text-frontend on cp1068 is CRITICAL: HTTP CRITICAL - No data received from host [14:55:51] PROBLEM - Varnish HTTP text-frontend on cp1055 is CRITICAL: Connection timed out [14:56:18] <_joe_> and this has to do with the number of 5xx we were seeing [14:56:21] PROBLEM - Varnish HTTP text-frontend on cp1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:22] PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [14:56:31] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:57:06] (03PS2) 10Andrew Bogott: Added role::labs::lvm::biglogs [operations/puppet] - 10https://gerrit.wikimedia.org/r/126992 [14:57:21] RECOVERY - Varnish HTTP text-frontend on cp1052 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 8.932 second response time [14:57:21] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66493 bytes in 8.544 second response time [14:57:29] (03CR) 10Andrew Bogott: [C: 032] Added role::labs::lvm::biglogs [operations/puppet] - 10https://gerrit.wikimedia.org/r/126992 (owner: 10Andrew Bogott) [14:57:31] PROBLEM - Varnish HTTP text-frontend on cp1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:57:57] * anomie checks that his agent forwarding to tin is still functioning, not like Tuesday [14:58:31] PROBLEM - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL - No data received from host [14:58:32] <_joe_> mh can't see what's wrong with those varnishes [14:59:41] RECOVERY - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66373 bytes in 8.181 second response time [14:59:51] RECOVERY - Varnish HTTP text-frontend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 9.242 second response time [15:00:21] RECOVERY - Varnish HTTP text-frontend on cp1066 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 5.920 second response time [15:00:22] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.040 second response time [15:01:21] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66493 bytes in 6.095 second response time [15:01:31] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66528 bytes in 6.197 second response time [15:01:51] RECOVERY - Varnish HTTP text-frontend on cp1068 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 8.284 second response time [15:02:32] Chase here, nickserv is giving me a fit, lots of paging but site seems up? Not sure what to do about it [15:03:00] <_joe_> Guest67437: I'm looking into it, not clear to me what's going on [15:04:10] !log reinstalling elastic1016 [15:04:17] Logged the message, Master [15:04:31] RECOVERY - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66405 bytes in 6.150 second response time [15:04:35] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:39] <_joe_> ok, varnish on cp1053 is simply resetting connections [15:04:51] PROBLEM - Varnish HTTP text-frontend on cp1068 is CRITICAL: Connection timed out [15:04:55] James_F|Away, RoanKattouw_away: ping [15:05:20] * anomie waits for James_F|Away or RoanKattouw_away to be able to do their SWAT deploy [15:05:21] PROBLEM - Varnish HTTP text-frontend on cp1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:21] RECOVERY - Varnish HTTP text-frontend on cp1066 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 7.744 second response time [15:06:22] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66531 bytes in 5.020 second response time [15:07:01] PROBLEM - Host elastic1016 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:20] 502s ? [15:07:21] PROBLEM - Varnish HTTP text-frontend on cp1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:23] https://www.mediawiki.org/wiki/ looks to be down [15:07:26] chrismcmahon: LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL [15:07:28] anomie: let's hold on any deploys for a little bit til this is sorted out [15:07:36] anomie: Here. [15:07:38] anomie: ^^^^ Where I said I was around but mobile. [15:07:41] <_joe_> bd808: eqiad is having problems [15:07:54] James_F: Making sure you were still around. But we're holding on apergos now. [15:08:11] <_joe_> I'm trying to figure out which problems, but I'm still *very* new to the environment [15:08:38] _joe_, i'm not sure I know much either, but maybe I can help [15:08:42] let's see! [15:08:52] RECOVERY - Varnish HTTP text-frontend on cp1068 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 9.282 second response time [15:09:05] hey [15:09:09] what's going on? [15:09:22] PROBLEM - Varnish HTTP text-frontend on cp1066 is CRITICAL: Connection timed out [15:09:24] eqiad varnish unhappiness I see [15:09:28] dunno, text-lb.eqiad is not happy [15:09:29] yeha [15:09:30] <_joe_> paravoid: varnishes in eqiad having troubles [15:09:31] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:35] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:35] <_joe_> I cannot figure out why [15:10:21] RECOVERY - Varnish HTTP text-frontend on cp1052 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 6.741 second response time [15:10:21] RECOVERY - Varnish HTTP text-frontend on cp1066 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 5.634 second response time [15:10:31] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66536 bytes in 8.450 second response time [15:10:35] PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [15:10:43] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Text%20caches%20eqiad&h=cp1055.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1397746916&v=19402&m=frontend.n_sess&vl=N&ti=N%20struct%20sess&z=large [15:11:01] fundraising I think [15:11:12] <_joe_> paravoid: what is that? number of sessions? [15:11:31] RECOVERY - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66406 bytes in 4.383 second response time [15:11:35] <_joe_> the timing fits [15:11:52] PROBLEM - Varnish HTTP text-frontend on cp1068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:52] yes, too many requests [15:11:57] /wiki/Special:HideBanners?duration=1209600&category=fundraising [15:12:11] RECOVERY - Host elastic1016 is UP: PING OK - Packet loss = 0%, RTA = 1.55 ms [15:12:22] <_joe_> mh, but I've seen on graphite, the number of requests dropped [15:12:28] <_joe_> maybe that stat is confusing [15:12:30] (03CR) 10Ottomata: [C: 031] Elasticsearch site plugins [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/126986 (owner: 10Manybubbles) [15:13:09] (03CR) 10Manybubbles: [C: 032 V: 032] "Two +1s = +2, right? Deploying this now." [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/126986 (owner: 10Manybubbles) [15:13:31] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66494 bytes in 9.817 second response time [15:14:12] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13 [15:14:21] PROBLEM - puppet disabled on elastic1016 is CRITICAL: Connection refused by host [15:14:21] PROBLEM - check if dhclient is running on elastic1016 is CRITICAL: Connection refused by host [15:14:21] RECOVERY - Varnish HTTP text-frontend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 3.497 second response time [15:14:21] PROBLEM - Varnish HTTP text-frontend on cp1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:14:21] PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [15:14:41] PROBLEM - Disk space on elastic1016 is CRITICAL: Connection refused by host [15:14:41] PROBLEM - DPKG on elastic1016 is CRITICAL: Connection refused by host [15:14:51] PROBLEM - RAID on elastic1016 is CRITICAL: Connection refused by host [15:14:51] PROBLEM - SSH on elastic1016 is CRITICAL: Connection refused [15:15:01] PROBLEM - check configured eth on elastic1016 is CRITICAL: Connection refused by host [15:15:21] RECOVERY - Varnish HTTP text-frontend on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 7.418 second response time [15:16:21] PROBLEM - Varnish HTTP text-frontend on cp1066 is CRITICAL: HTTP CRITICAL - No data received from host [15:16:52] RECOVERY - Varnish HTTP text-frontend on cp1068 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 9.724 second response time [15:17:37] !log updgraded site plugins on Elasticsearch nodes [15:17:43] Logged the message, Master [15:17:43] bd808: ^^^ whatson is pretty damn pretty [15:18:11] Yeah? I hadn't heard of it before. I'll have to check it out [15:18:21] RECOVERY - Varnish HTTP text-frontend on cp1066 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 9.524 second response time [15:18:51] PROBLEM - Varnish HTTP text-frontend on cp1055 is CRITICAL: Connection timed out [15:19:31] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:35] RECOVERY - Varnish HTTP text-frontend on cp1067 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.096 second response time [15:20:21] PROBLEM - Varnish HTTP text-frontend on cp1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:31] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66493 bytes in 8.269 second response time [15:20:43] Do we need to worry about the email to SMS thing? [15:21:05] Is this going to be fixed soon? :/ [15:21:11] RECOVERY - Varnish HTTP text-frontend on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 0.403 second response time [15:21:16] pirsquared: being investigated. [15:21:21] https://web.archive.org/web/20140417151929/https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22) [15:21:22] <_joe_> pirsquared: we're working on it [15:21:27] ok [15:21:31] RECOVERY - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66404 bytes in 5.419 second response time [15:21:50] ok [15:21:52] it's getting better [15:22:05] FR turned off banners [15:22:11] RECOVERY - Varnish HTTP text-frontend on cp1053 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 0.109 second response time [15:22:12] so this is a a single varnish that went crazy? [15:22:16] <_joe_> paravoid: that was the reason, then [15:22:20] it's /probably/ getting better, let's wait and see [15:22:29] still tons of banners though [15:22:31] <_joe_> chasemp: like the whole text cluster in eqiad [15:22:41] RECOVERY - Varnish HTTP text-frontend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 0.003 second response time [15:22:43] ok that makes more sense and ouch [15:23:14] <_joe_> paravoid: ~ 6K RxURL for bannners on all the varnishes I'm controlling now [15:23:31] PROBLEM - Varnish HTTP text-frontend on cp1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:31] <_joe_> it was 10K some minutes ago [15:23:36] paravoid, _joe_: Please ping me when I'm clear to do the scheduled SWAT deploy (it's some VE/Math stuff). Thanks (: [15:25:21] RECOVERY - Varnish HTTP text-frontend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 0.001 second response time [15:26:08] <_joe_> anomie: will do, either us or someone else [15:26:09] (03PS1) 10Faidon Liambotis: Revert "Enable CentralNotice CrossWiki Hiding" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126998 [15:26:32] (03CR) 10Faidon Liambotis: [C: 032] Revert "Enable CentralNotice CrossWiki Hiding" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126998 (owner: 10Faidon Liambotis) [15:26:51] PROBLEM - NTP on elastic1016 is CRITICAL: NTP CRITICAL: No response from NTP server [15:27:04] (03Merged) 10jenkins-bot: Revert "Enable CentralNotice CrossWiki Hiding" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126998 (owner: 10Faidon Liambotis) [15:27:28] !log faidon updated /a/common to {{Gerrit|If74ba5a52}}: Revert "Enable CentralNotice CrossWiki Hiding" [15:27:34] Logged the message, Master [15:27:51] RECOVERY - SSH on elastic1016 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [15:28:02] !log faidon synchronized wmf-config/CommonSettings.php 'disable CN CrossWiki Hiding again' [15:28:06] Logged the message, Master [15:34:34] ok, the HideBanners fix helped [15:35:01] <_joe_> paravoid: every metric is recovering as far as I can see [15:35:34] I think cause of the outage was the crosswiki hiding re-revert + fundraising running banners [15:37:49] paravoid: can you elaborate on what the crosswiki hiding re-revert is about? [15:38:34] (I'll wait for email if you're going to write it up anyway) [15:38:48] it's a CentralNotice feature that hides banners across all domains when you click the little "X" [15:39:04] so it basically does multiple requests for a 0-length image across all top-level domains [15:39:13] that had been enabled in the past, caused issues, I reverted it [15:39:23] matt thought he fixed it and reenabled it two days ago [15:39:47] so those requests enmass hammered varnish into submission? [15:40:14] (03PS2) 10Reedy: testwiki to 1.24wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126980 [15:40:18] so enabling banners, plus hiding the banners with the cross-wiki toggle, makes clients fetch that image 0-length? [15:40:19] my guess is that these requests + fundraising running banners today + organic traffic growth due to the US waking up [15:40:19] (03CR) 10Reedy: [C: 032] testwiki to 1.24wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126980 (owner: 10Reedy) [15:40:27] (03Merged) 10jenkins-bot: testwiki to 1.24wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126980 (owner: 10Reedy) [15:40:38] !log reedy Started scap: testwiki to 1.24wmf1 and build l10n cache [15:40:39] this showed banners to lots of users, and several of them clicked to hide them [15:40:44] Logged the message, Master [15:40:45] times 12 requests for each of them [15:40:57] <_joe_> paravoid: oh crap [15:41:02] ouch [15:41:39] https://gerrit.wikimedia.org/r/#/c/96641/ https://gerrit.wikimedia.org/r/#/c/96643/ https://gerrit.wikimedia.org/r/#/c/126065/ https://gerrit.wikimedia.org/r/126998 [15:42:02] <_joe_> anomie: I think you are good to go now; paravoid do you agree? [15:42:10] James_F: Still around? [15:42:15] anomie: Yup. [15:42:28] yes, go ehad [15:42:31] *ahead [15:42:39] and please don't break the site :) [15:42:44] * anomie is starting the SWAT deploy process [15:43:01] paravoid: If the site breaks, blame James_F and RoanKattouw_away. It's their code ;) [15:43:12] :) [15:43:13] * James_F grins. [15:43:25] Our code already running without issues elsewhere, but yes. :-) [15:43:52] paravoid: is this feature supposed to work by the client caching the 0-length images? [15:44:04] no [15:44:21] it's supposed to work by getting a Set-Cookie back [15:45:11] but because you can't set a cookie for *.org, the code does multiple requests, once per project [15:46:02] one request per domain that you see there: https://gerrit.wikimedia.org/r/#/c/126065/1/wmf-config/CommonSettings.php [15:46:56] <_joe_> paravoid: and those requests are non-cacheable, right? [15:47:03] they should be [15:47:21] they weren't initially, mwalker fixed that [15:47:24] <_joe_> still, a 12x boost in requests is *impossible* to sustain [15:47:37] !log anomie synchronized php-1.23wmf22/extensions/Math 'SWAT: 126913 - backport to wmf22 of critical fixes for the Math extension's VisualEditor tool' [15:47:43] Logged the message, Master [15:47:51] !log anomie synchronized php-1.23wmf22/extensions/VisualEditor 'SWAT: 126913 - backport to wmf22 of critical fixes for the Math extension's VisualEditor tool' [15:47:58] Logged the message, Master [15:48:00] James_F: ^ There you go [15:48:20] * anomie is done with the SWAT deploy [15:49:12] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5627: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [15:49:27] anomie: Thanks! [15:50:10] so, interesting [15:50:14] http://en.wikipedia.org/wiki/Special:HideBanners?duration=1209600&category=fundraising is properly cached [15:50:25] http://commons.wikimedia.org/wiki/Special:HideBanners?duration=1209600&category=fundraising does not even respond [15:50:28] it times out [15:54:21] RECOVERY - puppet disabled on elastic1016 is OK: OK [15:54:21] RECOVERY - check if dhclient is running on elastic1016 is OK: PROCS OK: 0 processes with command name dhclient [15:54:41] RECOVERY - Disk space on elastic1016 is OK: DISK OK [15:54:41] RECOVERY - DPKG on elastic1016 is OK: All packages OK [15:54:51] RECOVERY - RAID on elastic1016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:55:01] RECOVERY - check configured eth on elastic1016 is OK: NRPE: Unable to read output [15:55:24] ok this is crazy [15:55:48] akosiaris: is your work with OpenStreetMaps related to the 'maps' project in labs, or are those two different things? [15:56:02] paravoid: bad redirect at that commons URL? [15:56:15] bad redirect? [15:56:25] i.e. why does it time out [15:56:35] varnish isn't very happy that it's so busy I think [16:00:21] ah dammit [16:00:21] I know [16:00:31] damn [16:02:41] PROBLEM - Puppet freshness on db1056 is CRITICAL: Last successful Puppet run was Wed 16 Apr 2014 06:54:47 AM UTC [16:04:29] yeah pretty sure I found it [16:04:51] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 519410 bytes in 9.860 second response time [16:05:01] manybubbles: moving shards back to 1016 [16:05:08] bd808: ottomata yay [16:05:25] that's the last of the non master dance nodes [16:06:14] ottomata: yay [16:06:30] so, you wanna start on masters this afternoon later? [16:06:45] !log reedy Finished scap: testwiki to 1.24wmf1 and build l10n cache (duration: 26m 06s) [16:06:51] Logged the message, Master [16:07:41] RECOVERY - NTP on elastic1016 is OK: NTP OK: Offset -0.01075088978 secs [16:07:59] yeah sure manybubbles [16:08:01] paravoid: the suspense is killing us [16:08:11] haha [16:08:14] :) [16:08:26] working on some trebuchet stuff, once i get a commit in let's start them [16:11:01] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:01] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 519342 bytes in 9.469 second response time [16:12:23] <_joe_> and on thursday gitblit goes donw [16:12:27] <_joe_> as usual [16:13:37] (03PS1) 10Faidon Liambotis: varnish: don't override Special:HideBanners' TTL [operations/puppet] - 10https://gerrit.wikimedia.org/r/127001 [16:13:38] _joe_: My money is still on the new branch creation across core+all extensions causing it [16:13:38] greg-g: there you go [16:14:19] <_joe_> bd808: I think you're right sir [16:14:44] <_joe_> bd808: nullpointers can be thrown due to that kind of thing [16:14:44] (03CR) 10Faidon Liambotis: [C: 032 V: 032] varnish: don't override Special:HideBanners' TTL [operations/puppet] - 10https://gerrit.wikimedia.org/r/127001 (owner: 10Faidon Liambotis) [16:16:03] paravoid: nice [16:17:30] _joe_: ^^ [16:18:23] <_joe_> paravoid: ouch [16:18:41] <_joe_> paravoid: so that was not cached. [16:19:20] nope [16:19:56] and the object was super busy [16:20:02] and that object wasn't one, but 12 objects [16:20:24] <_joe_> how come we did not kill the mw* servers then [16:20:24] spread consistently over 8 caches, so each of the backends probably had one of each [16:20:30] varnish died first [16:20:42] varnish doesn't handle super busy objects all that well [16:21:01] <_joe_> I'd have expected the opposite... but I've never used varnish to this scale [16:21:09] ouch [16:21:23] <_joe_> ok I really gotta bail out, see you later guys [16:21:40] varnish was coalescing all these frontend requests [16:21:45] into single backend requests [16:21:48] so they all piled up [16:23:01] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:38] paravoid: does that mean all of the requests for this object effectively hits one machine hosting the canonical backend copy? [16:28:32] paravoid: also, now that you've fixed the TTL snafu can we let fr restore the banner campaign? [16:31:52] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 519575 bytes in 9.488 second response time [16:37:25] btw, do we have any plans to upgrade to varnish 4? [16:37:40] it looks pretty nice, and it looks like the requisite vcl migrations could be done via erb / puppet [16:38:01] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:14] though i thought mark mentioned something about the vmod experience being substantially improved w/v4 and didn't see that [16:41:01] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 518973 bytes in 9.598 second response time [16:47:18] (03PS5) 10Ottomata: Running update-server-info for submodules during deployment_server_init [operations/puppet] - 10https://gerrit.wikimedia.org/r/126846 [16:48:12] (03PS6) 10Ottomata: Running update-server-info for submoduleswq [operations/puppet] - 10https://gerrit.wikimedia.org/r/126846 [16:49:34] (03PS7) 10Ottomata: Running update-server-info for submoduleswq [operations/puppet] - 10https://gerrit.wikimedia.org/r/126846 [16:50:57] (03CR) 10jenkins-bot: [V: 04-1] Running update-server-info for submoduleswq [operations/puppet] - 10https://gerrit.wikimedia.org/r/126846 (owner: 10Ottomata) [16:56:02] manybubbles: ok i'm going to grab some lunch real quick [16:56:09] ottomata: later [16:56:15] ? [16:56:18] oh [16:56:19] shoudl I start moving shards off of 1002 [16:56:20] first? [16:56:23] since it takes a while? [16:56:26] oh, AND 1001? [16:56:43] go over plan with me again real quick [16:59:24] manybubbles: ^^ [16:59:37] ottomata: [16:59:38] sure [16:59:49] so the plan is to move shards off of both the nodes we're doing the dancy with [16:59:55] change the masterness in puppet [17:00:01] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:07] then rebuild one node [17:00:20] rebuild the non-master, rather [17:00:24] then rebuild the master [17:00:38] the trick is that the master will remain the master until we bounce it [17:00:44] like, as soon as the new master is up and ready, but even before shards are back on it? [17:01:11] ottomata: that is safe because we've already moved shards off the old master too [17:01:18] would we need to run puppet everywhere (and bounce elasticsearch) to get new master settings live? [17:01:22] or does es do that dyanmically? [17:01:58] ottomata: the only machines that need the new master settings are the ones we're bouncing any way [17:02:11] just elastic1001 and elsatic1002 in this case [17:02:19] so for a real example: [17:02:31] start moving shards off of both of those nodes [17:02:44] (now should be fine) [17:03:01] then make a puppet change that takes master-eligibility from elastic1001 and gives it to 1002 [17:03:07] then rebuild 1002 [17:03:12] don't bounce 1001 [17:03:19] ok great, shoudl we wait til 1016 has full shards back? [17:03:22] or can we start that now? [17:03:31] once 1002 is rebuilt, we can start moving shards back to 1002 and then start immediately on 1001 [17:03:36] we can start now [17:03:38] ok [17:03:45] doing that then: moving shards off of 1001 and 10012 [17:03:46] 1002 [17:03:56] its already got a bunch of shards [17:04:05] yeah [17:04:14] oh yeah it does [17:04:29] oh, i've never excluded two ips before [17:04:46] commas [17:04:48] ok [17:05:30] like this? [17:05:31] \"cluster.routing.allocation.exclude._ip\" : \"10.64.0.108\",\"10.64.0.109\" [17:05:59] manybubbles: ? [17:06:01] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 453634 bytes in 9.306 second response time [17:06:07] (03CR) 10BryanDavis: Running update-server-info for submoduleswq (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/126846 (owner: 10Ottomata) [17:06:24] like"1064.0.108,10.64.0.109" [17:06:49] oh, in the same quotes [17:06:50] ok [17:06:59] ok running that [17:07:18] "exclude" : { [17:07:18] "_ip" : "10.64.0.108,10.64.0.109" [17:07:18] cool [17:07:19] ? [17:07:56] there they go [17:07:57] cool [17:08:00] ok lunchtime :) [17:20:51] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [17:22:01] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:20] (03CR) 10Yurik: [C: 04-1] Update Zero netmapper data from zero.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/126829 (owner: 10BBlack) [17:30:01] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 453599 bytes in 9.797 second response time [17:45:55] paravoid, thanks for doing that revert [17:46:08] hey mwalker [17:46:12] I'm about to rerevert [17:46:42] makes sense; so my caching fix was not actually a fix then [17:46:53] ? [17:47:41] your fix was actually perfectly fine [17:47:49] but see https://gerrit.wikimedia.org/r/127001 [17:48:05] (03CR) 10JanZerebecki: [C: 031] Move "RewriteEngine On" earlier in www.wikimedia.org vhost [operations/apache-config] - 10https://gerrit.wikimedia.org/r/91339 (owner: 10Reedy) [17:49:01] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:31] oh! I'd forgotten about that [17:50:56] (03PS2) 10Ottomata: Adding ensure parameter to varnish::logging [operations/puppet] - 10https://gerrit.wikimedia.org/r/125742 [17:51:01] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 445125 bytes in 9.648 second response time [17:51:05] (03CR) 10Ottomata: [C: 032 V: 032] Adding ensure parameter to varnish::logging [operations/puppet] - 10https://gerrit.wikimedia.org/r/125742 (owner: 10Ottomata) [17:51:52] (03PS3) 10Ottomata: Setting up varnishncsa instance for erbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/125743 [17:53:41] PROBLEM - Host ps1-b5-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [17:53:45] (03CR) 10Ottomata: [C: 032 V: 032] Setting up varnishncsa instance for erbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/125743 (owner: 10Ottomata) [17:54:09] (03PS2) 10Reedy: Wikipedias to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126981 [17:59:06] (03PS1) 10Andrew Bogott: Purge nginx logs after two days. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127019 [17:59:50] (03CR) 10Andrew Bogott: [C: 032] Purge nginx logs after two days. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127019 (owner: 10Andrew Bogott) [18:00:14] (03CR) 10Reedy: [C: 032] Wikipedias to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126981 (owner: 10Reedy) [18:00:50] (03Merged) 10jenkins-bot: Wikipedias to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126981 (owner: 10Reedy) [18:01:15] twkozlowski, yt? [18:02:04] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.23wmf22 [18:02:26] Logged the message, Master [18:06:57] (03CR) 10MaxSem: Create a FeaturedFeed for the Tech News bulletin (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124272 (owner: 10Odder) [18:07:20] ottomata: 1002 looks ready for you [18:08:29] bblack, let me know what you think about my comments. Other than a minor issue, i don't see any problems. yet :) [18:09:13] OOO yup [18:09:21] ok [18:10:01] !log stopping puppet on elastic1001 and elastic1002, reinstalling elastic1002 [18:10:06] Logged the message, Master [18:10:25] (03CR) 10Odder: [C: 031] add wiktionary.eu, link to wiktionary.org [operations/dns] - 10https://gerrit.wikimedia.org/r/126932 (owner: 10Dzahn) [18:11:36] (03PS1) 10Ottomata: elastic1002 now master eligible, elastic1001 no longer [operations/puppet] - 10https://gerrit.wikimedia.org/r/127023 [18:11:37] MaxSem: Yes, I'm here right now. [18:11:55] MaxSem: Is there any way we can enable an Atom feed? :-) [18:12:02] manybubbles: https://gerrit.wikimedia.org/r/#/c/127023/ [18:12:24] twkozlowski, do you already have the messages for https://gerrit.wikimedia.org/r/#/c/124272/ in place on meta? [18:12:36] (03CR) 10Manybubbles: [C: 031] elastic1002 now master eligible, elastic1001 no longer [operations/puppet] - 10https://gerrit.wikimedia.org/r/127023 (owner: 10Ottomata) [18:13:12] MaxSem: As I said in a comment, yes. [18:13:22] (03PS2) 10Ottomata: elastic1002 now master eligible, elastic1001 no longer [operations/puppet] - 10https://gerrit.wikimedia.org/r/127023 [18:13:28] (03CR) 10Ottomata: [C: 032 V: 032] elastic1002 now master eligible, elastic1001 no longer [operations/puppet] - 10https://gerrit.wikimedia.org/r/127023 (owner: 10Ottomata) [18:13:34] Apr 6 11:16 PM [18:13:34] okay, then I'll nominate them for swatting today [18:13:41] \o/ [18:13:58] Cool, thanks [18:14:02] (03PS2) 10Reedy: Rest of group0 to 1.24wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126982 [18:14:07] (03CR) 10Reedy: [C: 032] Rest of group0 to 1.24wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126982 (owner: 10Reedy) [18:14:17] (03Merged) 10jenkins-bot: Rest of group0 to 1.24wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126982 (owner: 10Reedy) [18:15:19] MaxSem: Though https://gerrit.wikimedia.org/r/#/c/124272/6/wmf-config/InitialiseSettings.php [18:15:32] I set it to weekly updates following the frwikisource example [18:15:44] But I'm having second thoughts whether it would work for us? [18:15:49] We only need it to update on Mondays. [18:16:01] PROBLEM - Host elastic1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:16:44] MaxSem: https://meta.wikimedia.org/w/index.php?title=MediaWiki:Ffeed-technews-page&action=edit has the magic from the documentation at MediaWiki.org [18:17:29] eh, this is wrong [18:17:52] should be just Tech/News/{{CURRENTYEAR}}/{{CURRENTWEEK}} [18:18:15] Yeah, this would only work on daily updates, right? [18:18:29] or so I thing, haven't touched it much over the last 2 years:) [18:19:07] the page should evaluuate to feed page name for every given moment of time [18:19:09] MaxSem: I'll have someone fix it then [18:19:34] it should just change the result once in a week [18:20:56] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf1 [18:21:01] yurik: there are no comments, just a blank -1 :) [18:21:11] RECOVERY - Host elastic1002 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [18:21:16] in any case, I have a few puppet-y odds and ends to sort out before that can go out [18:21:19] Logged the message, Master [18:22:37] (03CR) 10Yurik: "sorry, comments are on PS2 accidently, but apply to PS3" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/126829 (owner: 10BBlack) [18:22:52] Reedy, from 22 straight to 24?:P [18:23:06] oh, 1.24 [18:23:09] wheeeeeeeeeeeeeee [18:23:21] PROBLEM - SSH on elastic1002 is CRITICAL: Connection refused [18:23:21] PROBLEM - RAID on elastic1002 is CRITICAL: Connection refused by host [18:23:31] PROBLEM - check configured eth on elastic1002 is CRITICAL: Connection refused by host [18:23:39] bblack, comments were in the wrong PS2, so they didn't post. let me know [18:23:41] PROBLEM - puppet disabled on elastic1002 is CRITICAL: Connection refused by host [18:23:41] PROBLEM - DPKG on elastic1002 is CRITICAL: Connection refused by host [18:23:41] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.109 [18:23:51] PROBLEM - check if dhclient is running on elastic1002 is CRITICAL: Connection refused by host [18:24:11] PROBLEM - Disk space on elastic1002 is CRITICAL: Connection refused by host [18:25:35] yurik: ok commented on your comments (as an aside, that's kind of annoying that gerrit doesn't show you outdated unread comments) [18:26:20] bblack, if you are using chrome, install https://github.com/jdlrobson/gerrit-be-nice-to-me [18:27:01] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:18] also, bblack you need to hit "review" button - otherwise i don't see your comments [18:27:56] (03PS3) 10Ottomata: Changing erbium's udp2log instance to use unicast on port 8419 [operations/puppet] - 10https://gerrit.wikimedia.org/r/125744 [18:28:06] (03CR) 10BBlack: [C: 04-1] Update Zero netmapper data from zero.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/126829 (owner: 10BBlack) [18:29:20] (03CR) 10Ottomata: [C: 032 V: 032] "Great, unicast logs are blasting at erbium on 8419 now. Time to switch to unicast!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125744 (owner: 10Ottomata) [18:30:12] bblack, nope, need to review the patchset that you left the comments on - i suspect you did them on PS2 [18:30:20] welcome to gerrit [18:30:21] (03CR) 10JanZerebecki: [C: 031] "Yes please!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126205 (owner: 10Dzahn) [18:30:30] !log switching erbium udp2log instance from consuming multicast relay to unicast direct from varnishes [18:30:36] (03PS4) 10Reedy: Second batch of pilot sites for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125032 (owner: 10MarkTraceur) [18:30:38] yeah, gerrit sucks at these things [18:30:42] (03CR) 10Reedy: [C: 032] Second batch of pilot sites for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125032 (owner: 10MarkTraceur) [18:30:55] (03Merged) 10jenkins-bot: Second batch of pilot sites for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125032 (owner: 10MarkTraceur) [18:30:56] Logged the message, Master [18:30:57] (03CR) 10BBlack: Update Zero netmapper data from zero.wikimedia.org (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/126829 (owner: 10BBlack) [18:31:10] is it intentional that the lock next to https external links looks different now? [18:31:27] anyways, I'm reading further in python docs about both of your comments now, but bottom line: it works, and leaks don't matter when the script exits immediately anyways :) [18:32:14] but the python fd thing is interesting [18:32:33] the whole "with" hack to explicitly scope them because it's up to GC-randomness otherwise is... odd to me [18:32:56] !log reedy synchronized database lists files: I0c36c65bb9f405e03b84d3f6c6b93acda522c5c9 [18:33:15] Logged the message, Master [18:34:01] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 378098 bytes in 9.222 second response time [18:34:29] bblack, its the same in C# & Java - you have objects whose lifetime and proper disposing is more important than regular non-native, memory-only objects - hence you have a special IDisposable (C#) interface, and a language construct to do try: ... finaly: dispose [18:34:32] !log reedy synchronized wmf-config/InitialiseSettings.php 'Touch for I0c36c65bb9f405e03b84d3f6c6b93acda522c5c9' [18:34:40] Logged the message, Master [18:35:04] (03CR) 10Reedy: [C: 032] Adding '*.panoramio.com' to the wgCopyUploadsDomains array [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126384 (owner: 10Marco) [18:35:07] (03PS4) 10Reedy: Adding '*.panoramio.com' to the wgCopyUploadsDomains array [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126384 (owner: 10Marco) [18:35:20] (03PS2) 10Reedy: Remove $wmgUseMicroDesign [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126987 (owner: 10Bartosz Dziewoński) [18:35:21] PROBLEM - NTP on elastic1002 is CRITICAL: NTP CRITICAL: No response from NTP server [18:36:32] bblack, i am not sure what you mean by desired semantics? same_file_contents just compares the file, doesn't it? [18:36:59] (03PS4) 10BBlack: Update Zero netmapper data from zero.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/126829 [18:37:21] RECOVERY - SSH on elastic1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [18:38:52] yurik: my impression from reading the filecmp source was that it cares more about stat fields than we'd like, but I'll go re-read it [18:39:44] but otherwise, yes, compare_file_contents is mostly do_cmp from filecmp [18:42:08] Jeff_Green: yt? [18:42:15] wondering if this file is needed anymore [18:42:15] files/searchqa/bin/refresh_api_log [18:42:19] i don't see it installed by puppet anywhere [18:42:55] hmm. that's a good question. [18:42:57] (03CR) 10Reedy: [C: 032] Adding '*.panoramio.com' to the wgCopyUploadsDomains array [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126384 (owner: 10Marco) [18:43:03] (03Merged) 10jenkins-bot: Adding '*.panoramio.com' to the wgCopyUploadsDomains array [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126384 (owner: 10Marco) [18:43:58] bblack, later, i wanted to discuss with you your old proposal - to make the whole zero.vcl data driven - IPs will map not to the XCS id, but to a string that has XCS + all relevant requirements. [18:43:59] yurik: you're right! I just got confused with the array offsets in their _sig(stat()) thing. [18:44:11] cool :) [18:44:28] * yurik hates to write new code :D [18:44:45] (03PS5) 10BBlack: Update Zero netmapper data from zero.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/126829 [18:44:55] ottomata: clearly that wouldn't work anymore [18:45:20] yurik: yeah, so way way back before we talked about expanding the JSON to have structured data, I figured we could just pick a decent delimiter character and stuff N fields in the string we're using for the carrier ID currently [18:45:45] ottomata: afaik all it was for was to grab a list of history from API calls, which were then used as test input [18:46:05] bblack, yes, but we are still trying to figure out all the variants that we could have - look at all the complex IFs in zero.vcl [18:46:11] so, Jeff_Green I can just remove the file from puppet repo? [18:46:21] ottomata: yeah I guess so [18:46:39] yurik: {"470-03|m|o|s|en,ar,zh" => [ ... ], ... [18:46:47] k danke, it referenced emery and I am prepping emery for decom [18:47:00] it seems like most of the if-logic could be covered with a few flags for m-or-zero, opera-or-not, ssl-allowed, and a list of languages [18:47:01] if anyone is still using that for search QA they'll need a new way to grab logs to replay :-P [18:48:06] so, manybubbles, i'm running puppet on 1002 now [18:48:14] bblack, we might have different subsets of IPs with different rules - a carrier might have a WAP proxy that does not support IP whitelisting, and even though the XCS is the same, other rules might not be. Some backend changes required [18:48:16] as soon as all is well and we are ready to move shards to it [18:48:19] we should shut down 1001, right? [18:48:44] bblack, but yes, something along those lines [18:48:50] yurik: clearly the best answer is vmod_lua, and having your json contain raw lua code to process requests for each carrier's set of ranges :) [18:49:20] bblack, ESI ;) [18:49:32] solves all this mess once and for all [18:50:20] well, with varnish4 out now, we're finally heading down the road to finding out whether ESI will be usable or not [18:50:28] (03PS3) 10Reedy: Remove $wmgUseMicroDesign [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126987 (owner: 10Bartosz Dziewoński) [18:50:33] (03CR) 10Reedy: [C: 032] Remove $wmgUseMicroDesign [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126987 (owner: 10Bartosz Dziewoński) [18:50:41] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5627: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:50:51] (03Merged) 10jenkins-bot: Remove $wmgUseMicroDesign [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126987 (owner: 10Bartosz Dziewoński) [18:51:25] (03PS2) 10Reedy: Remove $wmgUsabilityEnforce [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126989 (owner: 10Bartosz Dziewoński) [18:51:28] (03PS1) 10Ottomata: Removing references to emery in prep for emery's decommission [operations/puppet] - 10https://gerrit.wikimedia.org/r/127032 [18:51:34] (03CR) 10Reedy: [C: 032] Remove $wmgUsabilityEnforce [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126989 (owner: 10Bartosz Dziewoński) [18:52:02] bblack, i'll believe it when i see it )) [18:52:16] (03Merged) 10jenkins-bot: Remove $wmgUsabilityEnforce [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126989 (owner: 10Bartosz Dziewoński) [18:53:35] (03CR) 10Yurik: [C: 031] "haven't tested, but python code looks good" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126829 (owner: 10BBlack) [18:54:01] (03PS2) 10Reedy: Create autopatrolled user group on brwikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126251 (owner: 10Odder) [18:54:09] (03CR) 10Reedy: [C: 032] Create autopatrolled user group on brwikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126251 (owner: 10Odder) [18:54:11] RECOVERY - NTP on elastic1002 is OK: NTP OK: Offset -0.09839582443 secs [18:54:16] (03Merged) 10jenkins-bot: Create autopatrolled user group on brwikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126251 (owner: 10Odder) [18:54:51] (03PS3) 10Reedy: Remove useless "confirmed" permission assignments [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116059 (owner: 10TTO) [18:54:55] (03CR) 10Reedy: [C: 032] Remove useless "confirmed" permission assignments [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116059 (owner: 10TTO) [18:55:05] (03Merged) 10jenkins-bot: Remove useless "confirmed" permission assignments [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116059 (owner: 10TTO) [18:55:32] (03PS1) 10Faidon Liambotis: Reenable CentralNotice CrossWiki Hiding [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127040 [18:55:34] mwalker: ^ [18:55:41] RECOVERY - puppet disabled on elastic1002 is OK: OK [18:55:41] RECOVERY - DPKG on elastic1002 is OK: All packages OK [18:55:51] RECOVERY - check if dhclient is running on elastic1002 is OK: PROCS OK: 0 processes with command name dhclient [18:56:11] RECOVERY - Disk space on elastic1002 is OK: DISK OK [18:56:13] !log reedy synchronized wmf-config/ [18:56:21] RECOVERY - RAID on elastic1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [18:56:31] RECOVERY - check configured eth on elastic1002 is OK: NRPE: Unable to read output [18:56:39] paravoid, thanks :) [18:57:06] (03CR) 10Faidon Liambotis: [C: 032] Reenable CentralNotice CrossWiki Hiding [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127040 (owner: 10Faidon Liambotis) [18:57:14] (03Merged) 10jenkins-bot: Reenable CentralNotice CrossWiki Hiding [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127040 (owner: 10Faidon Liambotis) [18:57:56] oh I'm in the middle of a deployment window [18:58:01] sorry, didn't realize until now [18:58:15] Reedy: okay for me to sync-file CommonSettings.php for ^^ ^ [18:58:19] Logged the message, Master [18:58:23] Yup [18:58:27] Fine by me [18:58:36] thanks [18:58:51] !log faidon updated /a/common to {{Gerrit|Ie95165065}}: Reenable CentralNotice CrossWiki Hiding [18:58:58] Logged the message, Master [18:59:09] !log faidon synchronized wmf-config/CommonSettings.php 'reenable CN CrossWiki Hiding' [18:59:20] Logged the message, Master [19:01:45] manybubbles: 1002 looks good [19:01:48] moving shards there [19:03:41] PROBLEM - Puppet freshness on db1056 is CRITICAL: Last successful Puppet run was Wed 16 Apr 2014 06:54:47 AM UTC [19:04:43] !log reinstalling elastic1001 [19:04:51] Logged the message, Master [19:04:53] (03PS4) 10Reedy: All wikis with <250k pages opted in [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126671 (owner: 10Chad) [19:05:12] manybubbles: yt? [19:05:22] hope that's ok! here it goes [19:05:31] (03PS5) 10Reedy: All wikis with <250k pages opted in to Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126671 (owner: 10Chad) [19:05:37] (03CR) 10Reedy: [C: 032] All wikis with <250k pages opted in to Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126671 (owner: 10Chad) [19:05:56] (03Merged) 10jenkins-bot: All wikis with <250k pages opted in to Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126671 (owner: 10Chad) [19:06:15] (03PS3) 10Reedy: Raise the Elasticsearch refresh interval [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126811 (owner: 10Manybubbles) [19:06:17] iiinteresting, ok, we are throwing more search traffic at elastic nodes while we are busy reinstalling some of them, eh? [19:06:18] cool! [19:06:19] haha [19:06:19] (03CR) 10Reedy: [C: 032] Raise the Elasticsearch refresh interval [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126811 (owner: 10Manybubbles) [19:06:30] (03Merged) 10jenkins-bot: Raise the Elasticsearch refresh interval [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126811 (owner: 10Manybubbles) [19:07:31] PROBLEM - Host elastic1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:08:11] (03CR) 10Ottomata: [C: 032 V: 032] Removing references to emery in prep for emery's decommission [operations/puppet] - 10https://gerrit.wikimedia.org/r/127032 (owner: 10Ottomata) [19:09:01] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:09:38] !log reedy synchronized database lists files: I6fc44d3eb829d656d352dab652148dd327b06679 [19:09:45] Logged the message, Master [19:10:01] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 490301 bytes in 9.673 second response time [19:10:19] !log reedy synchronized wmf-config/ [19:12:36] (03PS1) 10Faidon Liambotis: icinga: allow private1-esams through firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/127125 [19:12:41] RECOVERY - Host elastic1001 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:12:43] bblack: ^ [19:13:04] (03CR) 10Faidon Liambotis: [C: 032] icinga: allow private1-esams through firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/127125 (owner: 10Faidon Liambotis) [19:13:52] paravoid: hah, that makes sense [19:14:41] PROBLEM - RAID on elastic1001 is CRITICAL: Connection refused by host [19:14:41] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 [19:14:47] RobH: I'm reading Server Lifecycle doc for decomissioning emery [19:14:51] PROBLEM - check configured eth on elastic1001 is CRITICAL: Connection refused by host [19:14:51] PROBLEM - puppet disabled on elastic1001 is CRITICAL: Connection refused by host [19:15:11] PROBLEM - SSH on elastic1001 is CRITICAL: Connection refused [19:15:16] just double checking, since emery is in tampa and is old [19:15:21] PROBLEM - DPKG on elastic1001 is CRITICAL: Connection refused by host [19:15:21] PROBLEM - Disk space on elastic1001 is CRITICAL: Connection refused by host [19:15:21] PROBLEM - check if dhclient is running on elastic1001 is CRITICAL: Connection refused by host [19:15:22] we are decoming, not reclaiming, right? [19:15:32] (03PS1) 10Jalexander: Add meta to legalteamwiki import sources [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127126 [19:15:40] ottomata: lets see [19:16:03] oo Special section, reading :) [19:16:39] ottomata: I'm back [19:16:41] sorry [19:16:52] s'ok, elastif1002 is up and should be master eligible [19:16:54] ottomata: yep, decom [19:16:55] 1001 is installing [19:17:03] can you check that things are ok? [19:17:05] it was purchased in 2010, so its super old [19:17:11] ok cool, thanks RobH, yeah [19:17:23] yeah it hikn I can follow most of these steps and then update the rt ticket with what i've done [19:17:31] and leave it to someone else to do racktables and port swtiches, etc. [19:17:35] (i guess?) [19:17:38] the lifecycle is perhaps the single most polished document on wikitech at the moment [19:17:45] so im sure it'll be good [19:18:01] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:12] yea, on the 'decom specific' it stops for you at dropping a ticket for that stuff in the local queue [19:18:36] then chris or i will take care of wiping and unracking next week [19:18:47] k awesome [19:18:49] thanks [19:20:01] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 489100 bytes in 9.981 second response time [19:20:04] (03PS1) 10Ottomata: Decommissioning emery [operations/puppet] - 10https://gerrit.wikimedia.org/r/127127 [19:21:17] ottomata: so everything looks good. ^d is making some new indexes during this but that shouldn't impact us [19:21:22] k [19:21:23] cool [19:21:26] yeah i saw those things being merged [19:21:30] and was like...OoooOK! [19:21:42] won't hurt us [19:22:07] though, it looks like we're getting periodicly hit with slowness. have a look at terbium:/a/mw-log/CirrusSearch-slow.log [19:22:27] it looks like we get hit with a bunch of slow query from time to time. like something we do causes it? [19:22:29] I dunno [19:22:44] <^d> hmmmm [19:22:49] terbium? i thought fluorine? [19:23:01] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:41] fluorine, yeah [19:23:43] sorry [19:23:48] funky graph: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=es_query_time&s=by+name&c=Elasticsearch+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [19:24:03] looks like _something_ caused a bunch of queries to stall [19:24:06] for a few seconds [19:24:41] <^d> i'm in urwiki, creatin ur indexes [19:24:49] <^d> i love language codes [19:26:02] <^d> Ok, indexes all done, you guys shouldn't notice me anymore. [19:26:15] ^d: cool - just going to populate them? [19:26:27] that shouldn't really cause much/any additional load either [19:27:12] PROBLEM - NTP on elastic1001 is CRITICAL: NTP CRITICAL: No response from NTP server [19:27:31] <^d> manybubbles: Yeah, starting pass 1 now. Just leaving in a screen and going back to other things. [19:27:32] (03PS1) 10RobH: replacing blog.wikimedia.org.pem [operations/puppet] - 10https://gerrit.wikimedia.org/r/127130 [19:28:01] ^d: sweet [19:28:10] ottomata: I was looking at that funky blip [19:28:11] RECOVERY - Puppet freshness on lvs3002 is OK: puppet ran at Thu Apr 17 19:28:05 UTC 2014 [19:28:11] RECOVERY - SSH on elastic1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [19:28:32] it looks like there was a network blip or something [19:28:39] (03CR) 10RobH: [C: 032 V: 032] replacing blog.wikimedia.org.pem [operations/puppet] - 10https://gerrit.wikimedia.org/r/127130 (owner: 10RobH) [19:28:39] or a ganglia aggregator [19:28:55] because the number of cpus dropped off [19:29:20] !log blog.w.o certificate swap (yes, again ;), apache may hiccup [19:29:25] (03PS2) 10Ottomata: Decommissioning emery [operations/puppet] - 10https://gerrit.wikimedia.org/r/127127 [19:29:30] (03CR) 10Ottomata: [C: 032 V: 032] Decommissioning emery [operations/puppet] - 10https://gerrit.wikimedia.org/r/127127 (owner: 10Ottomata) [19:29:45] morebots: snap to it damn you [19:29:49] Logged the message, RobH [19:29:49] I am a logbot running on tools-exec-03. [19:29:49] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [19:29:49] To log a message, type !log . [19:30:31] !log disabling puppet on emery for decommission [19:30:38] Logged the message, Master [19:31:22] (03PS1) 10Ori.livneh: Set domain to TLD on GeoIP cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/127131 [19:33:05] !log blog.w.o cert replacement successful [19:33:14] ^ bblack that one's for you :P [19:33:15] Logged the message, RobH [19:36:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 488242 bytes in 9.873 second response time [19:38:19] (03PS1) 10RobH: ticket.wikimedia.org cert replacement [operations/puppet] - 10https://gerrit.wikimedia.org/r/127133 [19:38:40] (03CR) 10RobH: [C: 032 V: 032] ticket.wikimedia.org cert replacement [operations/puppet] - 10https://gerrit.wikimedia.org/r/127133 (owner: 10RobH) [19:38:44] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1948: active_shards: 5783: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:39:01] (03PS1) 10Jkrauska: Add jkrauska [operations/puppet] - 10https://gerrit.wikimedia.org/r/127134 [19:39:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:03] !log replacing ticket.wikimedia.org cert/key, apache may hiccup [19:40:14] Logged the message, RobH [19:41:24] RECOVERY - Puppet freshness on lvs3003 is OK: puppet ran at Thu Apr 17 19:41:16 UTC 2014 [19:42:55] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 486921 bytes in 9.629 second response time [19:43:27] ottomata: elastic1001 is back online and not master eligible which is great. no plugins yet [19:43:44] RECOVERY - RAID on elastic1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [19:43:52] (03PS1) 10RobH: replacing star.wikimedia.org.pem, misc-web-lb.eqiad cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/127135 [19:43:54] RECOVERY - check configured eth on elastic1001 is OK: NRPE: Unable to read output [19:43:54] RECOVERY - puppet disabled on elastic1001 is OK: OK [19:43:58] its still doing stuff [19:44:05] i have to run puppet twice [19:44:11] (03PS2) 10Jkrauska: Add jkrauska [operations/puppet] - 10https://gerrit.wikimedia.org/r/127134 [19:44:14] RECOVERY - DPKG on elastic1001 is OK: All packages OK [19:44:15] RECOVERY - Disk space on elastic1001 is OK: DISK OK [19:44:24] RECOVERY - check if dhclient is running on elastic1001 is OK: PROCS OK: 0 processes with command name dhclient [19:44:24] RECOVERY - Puppet freshness on lvs3001 is OK: puppet ran at Thu Apr 17 19:44:23 UTC 2014 [19:45:15] RECOVERY - Puppet freshness on lvs3004 is OK: puppet ran at Thu Apr 17 19:45:09 UTC 2014 [19:45:16] (03CR) 10RobH: [C: 032 V: 032] replacing star.wikimedia.org.pem, misc-web-lb.eqiad cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/127135 (owner: 10RobH) [19:45:22] (03CR) 10jenkins-bot: [V: 04-1] Add jkrauska [operations/puppet] - 10https://gerrit.wikimedia.org/r/127134 (owner: 10Jkrauska) [19:46:04] !log power off emery [19:46:10] Logged the message, Master [19:47:06] (03PS3) 10Jkrauska: Add jkrauska [operations/puppet] - 10https://gerrit.wikimedia.org/r/127134 [19:47:24] RobH, emery is powered off and ready for the crusher or wherever it goes, what queue should I put this (or create a new) ticket in? [19:47:24] https://rt.wikimedia.org/Ticket/Display.html?id=6143&results=ddd5c126e004948b21fa6c382a0925e6 [19:47:29] https://rt.wikimedia.org/Ticket/Display.html?id=6143 [19:48:04] ottomata: so page says the appropriate datacenter specific queue [19:48:07] so if its in tampa, pmtpa queue [19:48:10] ok [19:48:30] manybubbles: 1001 is good to go, if you say ok i will begin moving shards to it [19:48:44] so just a ticket saying to decom emery, placed in pmtpa queue [19:48:44] PROBLEM - LVS HTTP IPv4 on misc-web-lb.eqiad.wikimedia.org is CRITICAL: Connection refused [19:48:48] PROBLEM - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection refused [19:48:52] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [19:48:53] arghhh [19:48:54] PROBLEM - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection refused [19:48:55] my fault [19:49:07] ottomata: go ahead! [19:49:10] fixing misc [19:49:30] cool, great! that was easy manybubbles [19:49:34] PROBLEM - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is CRITICAL: Connection refused [19:49:37] let's do the next ones tomorrow, i'm going to leave in about an hour [19:49:41] ottomata: I'm glad! [19:49:42] puppet happened to run automatically on one when i depooled the other [19:49:45] sounds good [19:49:50] lesson learned, halt puppet on both, then do this shit [19:50:24] pages? [19:50:34] yep! [19:50:46] it shoudl clear up shortly [19:50:50] <_joe_> apergos: RobH was missing us [19:50:55] heh [19:50:57] <_joe_> so he called us back here [19:51:00] you guys shouldnt be paged unless yer 24/7 [19:51:04] and then thats yer own damned fault ;] [19:51:12] it's before 11 pm [19:51:18] <_joe_> RobH: it' cet awake hours :) [19:51:23] oh, then, uhh [19:51:25] sorry =] [19:51:30] s'ok [19:51:40] we need EEST waking hours !!! [19:51:42] nice to come on and see it's not a crisis [19:51:47] we would have been spared ! [19:51:49] oh, nice, cp1044 is actually just borked [19:52:07] damn puppet for doing what its supposed to do when i wasnt ready for it to do it [19:53:10] <_joe_> RobH: I will use this quote. [19:54:19] hrmm [19:54:26] why is it not alerting its back... whats up pybal. [19:54:26] <_joe_> RobH: need assistance? [19:54:38] well, nginx is restored to service on both of them [19:54:50] checking out what pybal says [19:55:33] ottomata: elastic1001 isn't gangliaing [19:56:30] _joe_: so it says they are down in pybal [19:56:36] but they seem totally fine to me =/ [19:56:44] <_joe_> RobH: which lvs? [19:56:48] lvs1002 [19:56:50] <_joe_> I can try to check it [19:56:52] misc-web-lb.eqiad [19:57:14] RECOVERY - NTP on elastic1001 is OK: NTP OK: Offset -0.01193249226 secs [19:57:24] !log both cp1043 and cp1044 seem online and serving nginx service, but pybal says they are down still working [19:57:30] Logged the message, RobH [19:57:31] !log still working on issue [19:57:38] Logged the message, RobH [19:58:47] meh, odd cert dates in directories [19:58:48] ah manybubbles were you saying I needed to reboot gmond on those nodes? [19:58:54] im shredding them all on cp1043 and rerunning puppet [19:59:00] ottomata: I just went and did it [19:59:04] if that shows up in pybal afterwards will do same on cp1044 [19:59:06] I think it needs to happen as you start the nodes [19:59:12] like, maybe a bit after. not sure [19:59:15] this is never smooth. [19:59:23] <_joe_> RobH: openssl issue I'd say [19:59:49] how so, that its false on pybal's part? [20:00:20] <_joe_> RobH: mh I was looking at the logs of pybal [20:01:03] <_joe_> RobH: nevermind, sorry. go on with your actions [20:01:20] well, im not sure my actions are going to fix it, but its rolling to both now [20:02:41] paravoid: you about? [20:02:47] yes [20:02:49] what's up? [20:02:50] Im now getting into the field of 'wtf did i do' [20:03:02] so tried to replace cert on cp1043/44 and it seems fine on the local systems [20:03:06] (03CR) 10Manybubbles: [C: 031] Only load/enable Lucene on production (not on labs) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126804 (owner: 10Reedy) [20:03:09] but pybal is refusing to pool them stating they are down [20:03:18] and im not sure if its something ive done wrong or pybal being odd. [20:03:26] the certs appear fine on each system [20:03:50] ottomata: elastic1001 really doesn't seem to be doing ganglia [20:03:55] like it reports that it is down [20:04:15] the chain was created correctly as well [20:04:18] <_joe_> RobH: openssl connects to both servers [20:04:21] the chain isn't correct [20:04:26] it isnt? [20:04:30] but that wouldn't affect pybal [20:05:00] oh its listing one of them twice [20:05:01] thats odd. [20:05:42] no, its not even that, i have no idea wtf it did. [20:06:15] paravoid: but this happened cuz i didnt halt puppet on both before merging [20:06:22] and it called in at the wrong time, so bad move on my part. [20:06:37] <_joe_> however, I do see that on this lvs a lot of servers are down according to pybal [20:06:54] RECOVERY - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [20:06:57] fixed [20:07:06] the new resolv.conf config is buggy [20:07:10] <_joe_> paravoid: how? :) [20:07:33] we have nameserver 208.80.154.239 listed as the first nameserver on lvs1005 [20:07:36] that's dns-rec-lb [20:07:42] which is behind lvs [20:07:45] RECOVERY - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 226 bytes in 0.023 second response time [20:07:48] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [20:07:48] so lvs1005 has this IP on loopback as well [20:08:07] but lvs1005 doesn't run a recursor, so DNS on that IP doesn't work [20:08:13] and pybal apparently isn't very happy about that [20:08:44] so back to the fact my chain is still fubar but thats ok i can hack at it [20:08:51] paravoid: thank you for fixing! [20:08:56] np [20:09:20] bblack: ^^ [20:09:23] the thing I was just telling you [20:09:28] apparently it's causing problems [20:09:32] * paravoid files an RT [20:09:57] ok, RT, then outage report for the text-lb outage, then meeting [20:12:21] (03CR) 10Andrew Bogott: "Ryan says that this was originally coded by Domas and licensed public domain. So, azatoth, I think you can go ahead and relicense everyth" [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [20:12:38] ok [20:12:41] im still not sure wtf is wrong [20:12:46] and misc-web-lb is still down [20:12:56] no it's not? [20:12:57] i even manually fixed the chain to get it online for tweakin [20:13:04] ottomata: Apr 17 20:12:19 elastic1001 /usr/sbin/gmond[29393]: Check Operating System (kernel) limits, change or disable buffer size. Exiting.#012 [20:13:07] pybal says its not pooled [20:13:10] stuff isn't starting [20:13:35] RobH: where? [20:13:43] 2014-04-17 20:12:06.866352 [misc_web_80 IdleConnection] cp1043.eqiad.wmnet (enabled/up/pooled): Connection established. [20:13:46] 2014-04-17 20:12:07.212238 [misc_web6_80 IdleConnection] cp1044.eqiad.wmnet (enabled/up/pooled): Connection established. [20:13:49] works fine here [20:14:02] also works fine for icinga too, since we got the recoveries [20:14:03] on lvs1002? [20:14:09] why the heck am i not seeing that... [20:14:22] my last entries are all older as well [20:14:40] 1005 [20:14:42] (03PS1) 10Hashar: Pass puppet-lint on realm.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/127138 [20:14:57] ah wait [20:15:04] dammit [20:15:11] 1005 is primary for ipv6 but 1002 for ipv4 [20:15:12] fixing [20:15:19] (03CR) 10Hashar: "Potentially impacts all the production servers and labs instances!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127138 (owner: 10Hashar) [20:15:31] so i was on wrong system [20:15:31] (and I have IPv6, and you don't ;)) [20:15:34] RECOVERY - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 227 bytes in 0.009 second response time [20:15:35] ? [20:15:36] no, you weren't [20:15:37] <_joe_> paravoid: to add a bit of complexity [20:15:40] I just have IPv6 at home [20:15:45] RECOVERY - LVS HTTP IPv4 on misc-web-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.007 second response time [20:15:50] oh, you fixed one of the two lvs resolf issues? [20:15:53] yes [20:16:03] ok, i understand, mostly [20:16:21] we have two LVS servers per "class", right? [20:16:25] for redundancy [20:16:36] so misc-web-lb is lvs1002/lvs1005 [20:17:02] one is active, the other one is backup (so if the box dies, cr1/2-eqiad will notice that because of BGP and use the other one) [20:17:05] <_joe_> one was primary for ipv6 and the other one for ipv4 [20:17:05] okay so far? [20:17:12] exactly [20:17:17] ok, i get that part [20:17:32] but on both of them, it still shows both those sysetms down in pybal log [20:17:52] <_joe_> RobH: for which pool? [20:18:01] misc_web [20:18:13] i dont know why we are getting clear pages when it shows tha both cp servers are not pooled [20:18:22] well, nm one is pooled [20:18:25] but its down and pooled [20:18:31] pybal wont depool cuz its not allowed [20:18:37] but neither are passing their checks, hence this issue [20:18:40] and puppet reverted my resolv.conf change [20:18:45] ugh [20:18:45] !log disabling puppet on lvs1002/lvs1005 [20:18:52] Logged the message, Master [20:18:54] PROBLEM - Swift HTTP backend on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:19:06] DAMN YOU PUPPET ;] [20:19:08] <_joe_> paravoid: exactly [20:19:21] <_joe_> paravoid: im on 1002, what ip for the NS? [20:19:35] paravoid: and magically they all fall back into serivce [20:19:38] thank you =] [20:19:44] RECOVERY - Swift HTTP backend on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.034 second response time [20:19:45] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [20:19:47] !log lvs1002/1005: commenting first resolv.conf entry until we have a more permanent fix, restarting pybal [20:19:53] Logged the message, Master [20:19:54] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:19:56] i didnt trust the clear pages so i wanted to see for myself, and then it was already reverted by puppet i suppose [20:20:06] <_joe_> RobH: now we need to do something about puppet :) [20:20:12] oh crap, swift is in trouble [20:20:14] old yeller it in the corn crib! [20:20:14] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:30] this day is never going to end [20:20:41] !log sorry for the misc-web-lb issues folks, they should be resolved at this time (for now) [20:20:45] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.065 second response time [20:20:47] Logged the message, RobH [20:21:04] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.067 second response time [20:21:12] okay, it's back [20:21:20] ugh [20:25:47] bblack: #7308 [20:27:53] yeah, so chromium/hydrogen do listen on their "real" IPs as well (for recursive DNS) [20:27:57] upload wizard refuses to store files temporary in buffer [20:28:24] it's just a matter of figuring out how to tell puppet "only on the lvs servers that do the DNS balancing, chromium+hydrogen directly for resolv.conf" [20:28:35] !log restarting elastic1016 - it is freaking out. If it happens again I'll dig deeper, but for now I consider it a fluke of the rolling restarts today.... [20:28:41] Logged the message, Master [20:28:46] (03CR) 10BryanDavis: [WIP] Configure scap master and clients in beta (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 (owner: 10BryanDavis) [20:28:55] (03PS15) 10BryanDavis: [WIP] Configure scap master and clients in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 [20:34:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 483883 bytes in 9.729 second response time [20:37:08] (03PS1) 10Hashar: retab role/nova.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/127146 [20:37:10] (03PS1) 10Hashar: puppet-lint role/nova.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/127147 [20:37:39] mwalker: hey, I don't see HideBanners for FR banners right now [20:37:45] mwalker: are you still running those? [20:37:54] <_joe_> !log restarting gitblit in order to prevent crippling due to the usual memory leak [20:37:58] Logged the message, Master [20:38:03] paravoid, we are; but at a much reduced volume [20:38:21] or maybe only people in the US click close buttons... *shrugs* [20:38:25] (03PS2) 10Hashar: puppet-lint role/nova.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/127147 [20:38:27] heh [20:38:58] (03CR) 10Hashar: "Forgot to change $realm to $::realm :D" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127147 (owner: 10Hashar) [20:39:07] paravoid, do you want me to put something up to test? [20:39:26] what do you mean? [20:39:49] 1402 RxRequest c GET [20:39:49] 1402 RxURL c /wiki/undefined [20:39:52] lol [20:39:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:02] paravoid, I can put up a US banner [20:40:18] (03PS1) 10Dr0ptp4kt: Add HTTPS support for 514-02. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127148 [20:41:23] !log elastic1016 restarted and not freaking out any more. [20:41:29] Logged the message, Master [20:42:07] bblack, when you have a minute, would you please review and, if appropriate, +2 merge and deploy https://gerrit.wikimedia.org/r/#/c/127148/ ? [20:42:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 484848 bytes in 9.911 second response time [20:45:46] dr0ptp4kt: this is a new carrier, which is starting with HTTPS already? [20:45:55] bblack: yes [20:45:59] ok [20:46:31] (03CR) 10BBlack: [C: 032 V: 032] Add HTTPS support for 514-02. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127148 (owner: 10Dr0ptp4kt) [20:46:39] bblack: gracias [20:47:06] (03PS1) 10Ori.livneh: Follow-up to I02673456f [operations/puppet] - 10https://gerrit.wikimedia.org/r/127149 [20:47:28] (03CR) 10Ori.livneh: [C: 032 V: 032] Follow-up to I02673456f [operations/puppet] - 10https://gerrit.wikimedia.org/r/127149 (owner: 10Ori.livneh) [20:50:24] (03PS1) 10BBlack: fix recursive dns on lvs100[25] [operations/puppet] - 10https://gerrit.wikimedia.org/r/127150 [20:50:32] paravoid: ^ ugly, but would work? [20:50:46] I don't see any existing class/role designator that does that more implicitly [20:51:45] should an if-block rather than case anyways, for some reason I was thinking of more entries for codfw later, but different node stanza anyways [20:51:48] (03CR) 10jenkins-bot: [V: 04-1] fix recursive dns on lvs100[25] [operations/puppet] - 10https://gerrit.wikimedia.org/r/127150 (owner: 10BBlack) [20:52:58] (03PS2) 10BBlack: fix recursive dns on lvs100[25] [operations/puppet] - 10https://gerrit.wikimedia.org/r/127150 [20:53:16] (03CR) 10Ori.livneh: "@akosiaris, fyi" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127149 (owner: 10Ori.livneh) [20:54:09] (03CR) 10jenkins-bot: [V: 04-1] fix recursive dns on lvs100[25] [operations/puppet] - 10https://gerrit.wikimedia.org/r/127150 (owner: 10BBlack) [21:00:43] (03CR) 10Chad: [C: 031] "Fire away when you're ready." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126804 (owner: 10Reedy) [21:05:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:06:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 484993 bytes in 9.259 second response time [21:17:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:20:45] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [21:25:24] (03PS16) 10BryanDavis: [WIP] Configure scap master and clients in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 [21:26:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 486449 bytes in 9.490 second response time [21:29:32] (03PS17) 10BryanDavis: [WIP] Configure scap master and clients in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 [21:41:48] Somebody with server access here? [21:42:00] sjoerddebruin: what's the issue/question? [21:42:16] Some user needs his watchlist emptied. [21:42:35] 160,000 pages... ;/ [21:43:20] ha! [21:43:43] ? :) [21:44:10] sjoerddebruin: shell bug, probably then [21:44:28] Well, it's good to know that the old "10,000 page limit" is utterly incorrect now. (I'm at 8,900, and was trying to keep it below 10k) [21:45:20] Or would this work? https://www.mediawiki.org/wiki/Manual:Watchlist#Clearing_the_watchlist [21:46:27] might want to try it [21:47:02] greg-g: It doesn't work. User is going to file a bug. [21:47:10] kk [21:50:03] cmjohnson1, do you know anything about tantalum; specifically who set it up and when? [21:51:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:26] (03CR) 10Chad: [C: 031] remove admins::restricted from lucene role [operations/puppet] - 10https://gerrit.wikimedia.org/r/126939 (owner: 10Dzahn) [21:52:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 491357 bytes in 9.681 second response time [21:52:57] (03CR) 10Chad: [C: 031] remove admins::restricted from terbium,fluorine [operations/puppet] - 10https://gerrit.wikimedia.org/r/126941 (owner: 10Dzahn) [21:58:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:00] greg-g: https://bugzilla.wikimedia.org/show_bug.cgi?id=64074 [22:00:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 489945 bytes in 9.821 second response time [22:01:54] mwalker: sometime around early march and I think robh set it up [22:03:05] !log updated default labs precise image (heartbleed fix) [22:03:11] Logged the message, Master [22:03:24] RobH, RyanLane and I are going back and forth on the labs list currently about getting an ubuntu 14.04 image in labs. apparently he experienced some puppet / ruby difficulties when he tried a couple of weeks ago. if you could share some insight that would be wonderful [22:03:35] mwalker: uhh [22:03:40] i have no idea what you are talking about ;] [22:03:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:03:58] hehe; slander! [22:04:04] PROBLEM - Puppet freshness on db1056 is CRITICAL: Last successful Puppet run was Wed 16 Apr 2014 06:54:47 AM UTC [22:04:08] ok; I'll have to ask jeff tomorrow when he gets back then [22:04:10] mwalker: that's going to be pretty much me or ryan. [22:04:19] the ;] is not a 'im joking, i know' but a sorry, but i have no clue. [22:04:28] And as far as I know Ryan is still working on it so I'm leaving it to him for now. [22:05:02] andrewbogott, yepyep, I was just hoping the person who setup tantalum could provide some information on what they had to do to make tantalum work [22:05:39] because 14.04 can clearly be installed and puppetized; but there might be some pixy dust that needs to be applied [22:06:13] afaik it just worked. [22:06:20] but that doesnt mean i installed that one [22:06:30] i cannot keep track of what i put on what server =] [22:06:44] RobH, have you put any 14.04 images out? [22:06:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 489946 bytes in 9.747 second response time [22:08:19] it seems that osmium (wont install due to disk controller trusty driver), tantalum, and copper have trusty set in their dhcpd lease entries [22:08:41] no clue what copper does, no motd entry [22:08:59] mwalker: so it seems we have some, do i personally recall installing it? not really except on osmium where it failed to work recently [22:09:18] its all on RT tickets so I dont have to recall them anyhow ;] [22:10:06] i see https://rt.wikimedia.org/Ticket/Display.html?id=5917 [22:10:15] so it looks like faidon is using server copper which runs trusty [22:10:35] I think Jeff Green also recently had to do some trusty install work, but I am uncertain if he resolved anything with it. [22:10:46] (03PS3) 10BBlack: fix recursive dns on lvs100[25] [operations/puppet] - 10https://gerrit.wikimedia.org/r/127150 [22:10:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:10:55] bblack: ytmnd =] [22:11:11] (i broke things earlier cuz of that, heh) [22:11:39] the fix is less-than-ideal, but I guess it's better than leaving puppet disabled + resolv.conf hacked [22:13:02] (03CR) 10BBlack: [C: 032 V: 032] fix recursive dns on lvs100[25] [operations/puppet] - 10https://gerrit.wikimedia.org/r/127150 (owner: 10BBlack) [22:18:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 487192 bytes in 9.937 second response time [22:21:45] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [22:23:42] (03PS2) 10MaxSem: Kill all vestiges of $wgMFRemovableClasses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126826 [22:23:44] (03PS1) 10MaxSem: Normalize TextExtracts config handling [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127170 [22:23:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:24:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 487196 bytes in 9.699 second response time [22:29:34] (03PS1) 10BBlack: attempt to fix ordering issue w/ nameservers_prefix [operations/puppet] - 10https://gerrit.wikimedia.org/r/127171 [22:30:36] (03CR) 10BBlack: [C: 032 V: 032] attempt to fix ordering issue w/ nameservers_prefix [operations/puppet] - 10https://gerrit.wikimedia.org/r/127171 (owner: 10BBlack) [22:57:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:58:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 478705 bytes in 9.951 second response time [23:01:00] RoanKattouw_away, mwalker, ebernhardson: I can do SWAT today [23:01:40] ok [23:01:54] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 157.866669 [23:01:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:02:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 477223 bytes in 9.918 second response time [23:03:04] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 122.333336 [23:05:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:08:59] (03PS1) 10Legoktm: Enable GlobalCssJs on testwiki & test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127178 [23:10:40] (03PS1) 10Ori.livneh: Add GlobalCssJs to extension-list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127180 [23:13:05] (03CR) 10MaxSem: [C: 031] Create a FeaturedFeed for the Tech News bulletin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124272 (owner: 10Odder) [23:14:53] !log ori synchronized php-1.23wmf22/extensions/ApiSandbox 'I9a56b2c5a: Update ApiSandbox' [23:14:58] Logged the message, Master [23:15:38] (03PS7) 10Ori.livneh: Create a FeaturedFeed for the Tech News bulletin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124272 (owner: 10Odder) [23:15:41] (03CR) 10Legoktm: [C: 031] Add GlobalCssJs to extension-list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127180 (owner: 10Ori.livneh) [23:15:43] (03CR) 10Ori.livneh: [C: 032] Create a FeaturedFeed for the Tech News bulletin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124272 (owner: 10Odder) [23:15:51] (03Merged) 10jenkins-bot: Create a FeaturedFeed for the Tech News bulletin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124272 (owner: 10Odder) [23:15:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 474665 bytes in 9.721 second response time [23:16:56] Hmmm. ori: Thanks for the GlobalCssJs changes. I thought that had already been done in order for Beta Labs to have the extension. [23:16:56] !log ori updated /a/common to {{Gerrit|I1795c70d1}}: Create a FeaturedFeed for the Tech News bulletin [23:16:59] Guess not. [23:17:02] Logged the message, Master [23:17:19] Gloria: labs has different files than prod [23:17:20] Gloria: hm? [23:17:26] extension-list-labs versus extension-list, etc. [23:17:29] Ah. [23:17:50] I'm told there's some new requirement that extensions go to Beta Labs for a week first now. [23:17:56] !log ori synchronized wmf-config/FeaturedFeedsWMF.php 'I1795c70d1: Create a FeaturedFeed for the Tech News bulletin (1/2)' [23:17:58] I thought the point was to make this more streamlined. [23:18:02] Logged the message, Master [23:18:05] !log ori synchronized wmf-config/InitialiseSettings.php 'I1795c70d1: Create a FeaturedFeed for the Tech News bulletin (2/2)' [23:18:10] Logged the message, Master [23:18:11] Gloria: they should go to beta and think about what they've done [23:18:16] Heh. [23:20:17] MaxSem: does depend on the updates to the MobileFrontend submodule? [23:20:37] ori, no - they;re independent [23:20:45] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [23:20:46] k, just checking [23:21:15] (03PS3) 10Ori.livneh: Kill all vestiges of $wgMFRemovableClasses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126826 (owner: 10MaxSem) [23:21:26] (03CR) 10Ori.livneh: [C: 032] Kill all vestiges of $wgMFRemovableClasses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126826 (owner: 10MaxSem) [23:21:47] (03Merged) 10jenkins-bot: Kill all vestiges of $wgMFRemovableClasses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126826 (owner: 10MaxSem) [23:21:54] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [23:23:04] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [23:23:39] !log ori updated /a/common to {{Gerrit|I7841f74b0}}: Kill all vestiges of $wgMFRemovableClasses [23:23:44] Logged the message, Master [23:24:29] !log ori synchronized wmf-config/mobile.php 'I7841f74b0: Kill all vestiges of $wgMFRemovableClasses (1/2)' [23:24:35] Logged the message, Master [23:24:38] !log ori synchronized wmf-config/InitialiseSettings.php 'I7841f74b0: Kill all vestiges of $wgMFRemovableClasses (2/2)' [23:24:44] Logged the message, Master [23:24:57] (03PS2) 10Ori.livneh: Normalize TextExtracts config handling [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127170 (owner: 10MaxSem) [23:25:11] (03CR) 10Ori.livneh: [C: 032] Normalize TextExtracts config handling [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127170 (owner: 10MaxSem) [23:25:59] (03Merged) 10jenkins-bot: Normalize TextExtracts config handling [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127170 (owner: 10MaxSem) [23:26:40] !log ori updated /a/common to {{Gerrit|I373df6138}}: Normalize TextExtracts config handling [23:26:47] Logged the message, Master [23:27:22] !log ori synchronized wmf-config/InitialiseSettings.php 'I373df6138: Normalize TextExtracts config handling (1/2)' [23:27:28] Logged the message, Master [23:27:31] !log ori synchronized wmf-config/CommonSettings.php 'I373df6138: Normalize TextExtracts config handling (2/2)' [23:27:36] Logged the message, Master [23:28:22] (03PS2) 10Ori.livneh: Add meta to legalteamwiki import sources [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127126 (owner: 10Jalexander) [23:28:30] (03CR) 10Ori.livneh: [C: 032] Add meta to legalteamwiki import sources [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127126 (owner: 10Jalexander) [23:28:36] (03Merged) 10jenkins-bot: Add meta to legalteamwiki import sources [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127126 (owner: 10Jalexander) [23:28:47] !log ori updated /a/common to {{Gerrit|I52378a4b4}}: Add meta to legalteamwiki import sources [23:28:53] Logged the message, Master [23:28:55] Whoa, exciting SWAT. [23:29:08] thanks much ori [23:29:21] !log ori synchronized wmf-config/InitialiseSettings.php 'I52378a4b4: Add meta to legalteamwiki import sources' [23:29:27] Logged the message, Master [23:32:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 474430 bytes in 9.784 second response time [23:35:24] (03PS2) 10Ori.livneh: Add GlobalCssJs to extension-list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127180 [23:35:43] (03CR) 10Ori.livneh: [C: 032] Add GlobalCssJs to extension-list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127180 (owner: 10Ori.livneh) [23:36:46] (03Merged) 10jenkins-bot: Add GlobalCssJs to extension-list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127180 (owner: 10Ori.livneh) [23:37:08] !log ori updated /a/common to {{Gerrit|I2a2abd7f3}}: Add GlobalCssJs to extension-list [23:37:12] Logged the message, Master [23:38:50] !log ori Started scap: Cherry-pick Ibe8e67ebf for MobileFrontend on 1.23wmf22 and 1.24wmf1; add GlobalCssJs extension to 1.24wmf1 [23:38:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:56] Logged the message, Master [23:39:14] !log ori scap failed: CalledProcessError Command '/usr/local/bin/mw-update-l10n' returned non-zero exit status 1 (duration: 00m 24s) [23:39:20] Logged the message, Master [23:39:26] bd808|BUFFER: fun [23:39:43] Updating ExtensionMessages-1.23wmf22.php...Extension /a/common/php-1.23wmf22/extensions/GlobalCssJs/GlobalCssJs.php doesn't exist [23:39:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 472779 bytes in 9.944 second response time [23:45:43] !log ori Started scap: Cherry-pick Ibe8e67ebf for MobileFrontend on 1.23wmf22 and 1.24wmf1; add GlobalCssJs extension to 1.24wmf1 and 1.23wmf22 [23:45:47] Logged the message, Master [23:52:14] (03PS2) 10Legoktm: Enable GlobalCssJs on testwiki & test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127178 [23:57:54] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 51.733334 [23:58:24] (03CR) 10Ori.livneh: [C: 031] "LGTM; will sync after the current scap is done." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127178 (owner: 10Legoktm) [23:59:04] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 116.199997 [23:59:34] mwalker|away: can you update the DonationInterface submodule for https://gerrit.wikimedia.org/r/#/c/127123/ ?