[00:00:07] RoanKattouw, ^d: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150130T0000). Please do the needful. [00:04:11] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 1 failures [00:09:02] (03CR) 10Santhosh: [C: 031] cxserver: enable no->nn language pair [puppet] - 10https://gerrit.wikimedia.org/r/186522 (https://phabricator.wikimedia.org/T76674) (owner: 10KartikMistry) [00:10:04] (03PS1) 10Matthias Mullie: Add $wgParsoid... variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187625 [00:16:59] (03PS1) 10Dzahn: decom cp1037,cp1038,cp1039,cp1040 [dns] - 10https://gerrit.wikimedia.org/r/187626 (https://phabricator.wikimedia.org/T87800) [00:22:11] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [00:22:23] 3Wikimedia-Git-or-Gerrit, operations: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611#1002856 (10Dzahn) [01:08:21] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [01:30:34] !log powercycling cp1047 [01:30:42] Logged the message, Master [01:33:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:35:09] 3ops-eqiad, operations: cp1047 down - https://phabricator.wikimedia.org/T88045#1003104 (10Dzahn) p:5Triage>3Normal [01:36:41] !log cp1047 - DIMM fail -> T88045 [01:36:58] Logged the message, Master [02:14:28] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 02s) [02:14:41] Logged the message, Master [02:15:35] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-30 02:14:31+00:00 [02:15:44] Logged the message, Master [02:19:41] ACKNOWLEDGEMENT - Host cp1063 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T84809 [02:20:52] ACKNOWLEDGEMENT - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T88045 [02:23:53] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 03s) [02:24:07] Logged the message, Master [02:25:00] !log LocalisationUpdate completed (1.25wmf15) at 2015-01-30 02:23:56+00:00 [02:25:15] Logged the message, Master [02:38:21] (03PS1) 10Dzahn: torrus: add http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/187643 (https://phabricator.wikimedia.org/T87817) [02:38:30] 3Multimedia: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1003414 (10Tgr) a:5Tgr>3None [02:39:18] 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#1003431 (10Dzahn) a:3Dzahn [02:39:37] 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#999760 (10Dzahn) p:5Triage>3Low [02:40:51] 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#999760 (10Dzahn) this would translate to: ``` @neon:~# /usr/lib/nagios/plugins/check_http -H torrus.wikimedia.org -I 208.80.154.159 -u '/torrus' -s 'Torrus Top: Wikimedia' HTTP OK: HTTP/1.1 200 OK - 2166 bytes in 0.082 second respo... [02:41:43] (03CR) 10Dzahn: [C: 032] torrus: add http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/187643 (https://phabricator.wikimedia.org/T87817) (owner: 10Dzahn) [02:49:17] 3Ops-Access-Requests, operations: Give "hoo" sudo access to dataset snapshot hosts - https://phabricator.wikimedia.org/T86808#1003438 (10Dzahn) a:3ArielGlenn [02:55:54] (03PS11) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [02:56:52] (03CR) 10jenkins-bot: [V: 04-1] Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [02:58:23] (03PS12) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [02:59:17] (03CR) 10jenkins-bot: [V: 04-1] Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [03:01:55] (03PS13) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [03:05:45] (03PS14) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [03:07:44] (03PS15) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [03:08:23] (03CR) 10Dzahn: [C: 031] "PS1 was uploaded on 2014-08-07 ! _please_ unblock and no more rebasing :p" [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [03:10:02] 3Ops-Access-Requests, operations: Give "hoo" sudo access to dataset snapshot hosts - https://phabricator.wikimedia.org/T86808#1003461 (10Dzahn) fixed https://gerrit.wikimedia.org/r/#/c/152724/ needed bunch of rebasing. PS1 was uploaded 2014-08-07 please unblock this now [03:13:06] 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#1003468 (10Dzahn) @jgage https://gerrit.wikimedia.org/r/#/c/187643/ was supposed to resolve it, but it didnt show up in Icinga yet even though i ran puppet on neon.. will have to debug why [03:19:09] 3operations: Torrus is broken - https://phabricator.wikimedia.org/T87815#1003475 (10Dzahn) [03:19:10] 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#1003473 (10Dzahn) 5Open>3Resolved ah, here it is, just needed more patience: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netmon1001&service=torrus.wikimedia.org+HTTP https://icinga.wikimedia.org/cgi-bin/... [03:19:28] 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#1003476 (10Dzahn) [03:25:09] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail [03:31:21] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:09] PROBLEM - Disk space on labstore1001 is CRITICAL: DISK CRITICAL - free space: /var 3002 MB (3% inode=99%): [03:43:50] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [03:49:10] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [04:00:56] (03PS1) 10Mattrobenolt: Fix bad symlinks for kafka-common [debs/kafka] - 10https://gerrit.wikimedia.org/r/187648 [04:21:15] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jan 30 04:20:12 UTC 2015 (duration 20m 11s) [04:21:20] Logged the message, Master [04:38:50] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:51:20] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [06:19:00] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:01] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:10] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:20] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: puppet fail [06:28:20] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: puppet fail [06:28:49] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:40] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:59] PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:00] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:00] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:10] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:10] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:10] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:20] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:29] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:50] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:50] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:29] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:30] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:39] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:10] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:10] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:10] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:40] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:45:29] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:40] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on analytics1010 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on mw1054 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:47:50] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:48:29] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:01:55] 3operations: Sometimes error sec_error_ocsp_old_response - https://phabricator.wikimedia.org/T88087#1003802 (10Sunpriat) 3NEW [08:22:29] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1003895 (10Chmarkine) The certificate for svn.wikimedia.org will expire in about one day. There are some links to https://svn.wikimedia.org on enwiki that will be affected by the expiration. https://en.wikipedia.org/w/... [08:49:44] greetings [09:27:06] 3operations: Monitor Netapps - https://phabricator.wikimedia.org/T87836#1003943 (10faidon) [09:27:07] 3operations: Create Icinga alerts for Netapp health - https://phabricator.wikimedia.org/T87839#1003940 (10faidon) 5Open>3declined a:3faidon See parent task for more. [09:27:56] 3operations: Monitor Netapps - https://phabricator.wikimedia.org/T87836#1000337 (10faidon) [09:27:58] 3operations: Retire Torrus - https://phabricator.wikimedia.org/T87840#1003951 (10faidon) [09:27:59] 3operations: Graph Netapp SNMP stats with LIbreNMS - https://phabricator.wikimedia.org/T87837#1003947 (10faidon) 5Open>3declined a:3faidon See parent task for more. Moreover, assuming LibreNMS would be a good candidate was the wrong call; the parent task alone would be enough to describe the whole of this... [09:42:40] 3operations, ops-ulsfo: fan reversed on asw1-ulsfo - https://phabricator.wikimedia.org/T83978#1003963 (10faidon) @Gage, I know you went on-site on Friday; can you update this with the status? [09:55:04] (03PS1) 10Filippo Giunchedi: introduce graphite raid10-lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/187663 (https://phabricator.wikimedia.org/T85909) [09:55:06] (03PS1) 10Filippo Giunchedi: provision graphite[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/187664 (https://phabricator.wikimedia.org/T85909) [10:02:50] RECOVERY - Disk space on labstore1001 is OK: DISK OK [10:04:31] !log labstore1001: setting /proc/sys/sunrpc/{nfs,rpc}_debug to 0; rm /var/log/{kern.log,syslog.1,syslog} [10:04:41] Logged the message, Master [10:26:40] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: puppet fail [10:28:06] Jumping on labstore1001 now [10:28:09] hi! [10:28:21] thanks :) [10:29:43] Sorry about having been mostly away - I've slept about 40 out of the past 48 hours while the meds finally got rid of the crap in my lungs. :-( [10:30:07] oops [10:30:20] <_joe_> Coren: now it's just extremely slow [10:30:24] Yeah, well, the good part is I cn actually breathe [10:30:31] <_joe_> frm beta [10:30:43] <_joe_> Coren: :/ [10:31:03] (03PS2) 10Filippo Giunchedi: provision graphite[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/187664 (https://phabricator.wikimedia.org/T85909) [10:31:04] _joe_: Yeah, I'm seeing it - it's slow from everywhere. Something is hammering on it like crazy and I'm tracking down what [10:31:05] (03PS2) 10Filippo Giunchedi: introduce graphite raid10-lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/187663 (https://phabricator.wikimedia.org/T85909) [10:32:59] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:33:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] introduce graphite raid10-lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/187663 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [10:34:00] (03PS16) 10Giuseppe Lavagetto: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [10:41:48] paravoid: I might need you to take a peek switch-side - something odd is going on that might be network-related: NFS traffic stalls entirely for 4-5 secs, then 4-5 secs for near-saturation burst, then stalls again. [10:42:09] paravoid: And the traffic seems to arrive from only one half of the bonded ports [10:42:49] did we ever fix bonding? I think we didn't? [10:45:19] paravoid: Well, labstore1001 seems to think it's working; it sees the bonding with two slaves [10:46:58] I distinctly remember having issues with bonding in the past and giving up since there wasn't need for it [10:47:11] both me and mark had tried [10:47:21] the switch has a single-port bond configured [10:47:29] so that should be fine [10:47:55] Ah, that'd explain why everything is coming over one side only. :-) [10:48:06] yeah we've discussed this before [10:48:16] <_joe_> also, it's pprobably something misconfigured since the reboot [10:48:31] <_joe_> and now I get why there where that fuckton of messages [10:48:38] we didn't reboot [10:48:42] lemme have a look [10:49:05] diamond [10:49:06] 100% cpu [10:49:10] [pid 3686] stat("/exp/dumps", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [10:49:13] [pid 3686] stat("/exp/backups", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [10:49:16] in a loop [10:49:29] ah now stopped [10:49:58] md122? md127? [10:50:48] disks are saturated [10:50:55] esp. sdbg [10:51:05] Yeah, I'm trying to find the root cause. [10:51:11] dm0 is saturated completely [10:51:14] so, not a network issue [10:51:46] paravoid: No, chances are the bursty traffic is a symptom of the disk being bursty [10:51:55] paravoid: And not the cause as I first suspected [10:52:04] yup [10:54:28] <_joe_> I'm off for now, gonna grab some food + take a nap [10:54:56] The md*_raid6 processes are hard at it - I'd have thought it was doing raid validation but mdstat says not. Huh. [11:00:08] ttyl [11:00:51] <_joe_> paravoid: send me a message when you're at the hotel [11:01:10] <_joe_> not sure I'll be awake :) [11:01:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] provision graphite[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/187664 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [11:02:53] paravoid: Hm. Unless you changed something, whatever was hammering on the disks stopped hammering on the disks. [11:10:10] 3Parsoid, operations, Parsoid-Team: Multiple Parsoid crashes due to "Parse Error" on Russian Wikinews (WNRU) - https://phabricator.wikimedia.org/T88100#1004037 (10Kelson) 3NEW [11:23:16] PROBLEM - uWSGI web apps on graphite1001 is CRITICAL: CRITICAL: Not all configured uWSGI apps are running. [11:24:06] PROBLEM - gdash.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.009 second response time [11:24:16] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.010 second response time [11:24:25] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:26:40] paravoid: I had two outlier instances, ima keep an eye on both of them see if they go cray cray [11:27:25] I *was* about to say that things seemed back to normal, but there it goes again. [11:34:04] 3ops-codfw, operations: graphite2001 stuck at boot with "scanning for devices" - https://phabricator.wikimedia.org/T88101#1004046 (10fgiunchedi) 3NEW [11:46:26] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:02:56] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.166 second response time [12:28:45] 3Labs: Puppet logs should be timestamped in a human-readable way - https://phabricator.wikimedia.org/T88108#1004161 (10scfc) 3NEW [14:50:09] 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1004254 (10Dzahn) [14:50:30] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1004256 (10Dzahn) >>! In T73156#1003895, @Chmarkine wrote: > The certificate for svn.wikimedia.org will expire in about one day. This is T86655 (and historically T24596) [14:52:18] 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1004260 (10Dzahn) a:3Chad [14:53:15] 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#973496 (10Dzahn) [14:54:28] 3operations: Sometimes error sec_error_ocsp_old_response - https://phabricator.wikimedia.org/T88087#1004263 (10Dzahn) [14:54:35] ori: I'm trying to run gdash on ruby 1.9 on graphite1001 though it doesn't seem to honor $: << and doesn't find gdash, did you see this before? [14:57:34] 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1004265 (10mark) p:5High>3Normal We're making this a High priority just because the HTTPS cert expires? I don't know, do we care that much for a service essentially fallen in disuse? Su... [15:02:39] 3operations: Sometimes error sec_error_ocsp_old_response - https://phabricator.wikimedia.org/T88087#1004267 (10BBlack) [15:12:02] 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1004272 (10chasemp) FWIW I cherry-picked the relevant upstream commit on phab-01 for my man @Chad to test if the SVN issue is actually resolved. This week being crazy I think that still ne... [15:55:11] (03CR) 10BBlack: [C: 031] decom cp1037,cp1038,cp1039,cp1040 [dns] - 10https://gerrit.wikimedia.org/r/187626 (https://phabricator.wikimedia.org/T87800) (owner: 10Dzahn) [15:56:00] (03CR) 10BBlack: [C: 031] decom cp1037,cp1038,cp1039,cp1040 [puppet] - 10https://gerrit.wikimedia.org/r/187615 (https://phabricator.wikimedia.org/T87800) (owner: 10Dzahn) [15:56:48] 3operations: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#1004335 (10BBlack) >>! In T87800#1002742, @Dzahn wrote: >> "no longer have a puppet cache role" > > this looks like they still do: > > https://gerrit.wikimedia.org/r/#/c/187615/1/manifests/site.pp Sorry, I should have sa... [15:57:44] (03PS1) 10Filippo Giunchedi: graphite: explicit install python-twisted-core [puppet] - 10https://gerrit.wikimedia.org/r/187683 (https://phabricator.wikimedia.org/T85909) [16:01:32] (03PS1) 10BBlack: disable compact_memory on jessie T83809 [puppet] - 10https://gerrit.wikimedia.org/r/187684 [16:02:22] (03CR) 10BBlack: [C: 032] disable compact_memory on jessie T83809 [puppet] - 10https://gerrit.wikimedia.org/r/187684 (owner: 10BBlack) [16:13:05] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail [16:13:40] ^ generic puppetmaster http error [16:15:40] jgage: torrus now monitored https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=torrus [16:16:17] (03PS2) 10Giuseppe Lavagetto: mediawiki: allow using a different web user than apache [puppet] - 10https://gerrit.wikimedia.org/r/187259 [16:16:19] (03PS1) 10Giuseppe Lavagetto: labstore: do not explicitly declare the apache user existence [puppet] - 10https://gerrit.wikimedia.org/r/187686 [16:16:21] (03PS1) 10Giuseppe Lavagetto: maintenance: allow choosing the web user [puppet] - 10https://gerrit.wikimedia.org/r/187687 [16:16:23] (03PS1) 10Giuseppe Lavagetto: beta: allow defining the web user. [puppet] - 10https://gerrit.wikimedia.org/r/187688 [16:18:28] !log restarting frontend varnishes to apply increased cache sizes from https://gerrit.wikimedia.org/r/#/c/186816/ over the next ~9H [16:18:36] Logged the message, Master [16:26:48] (03PS1) 10Filippo Giunchedi: graphite: format /var/lib/carbon [puppet] - 10https://gerrit.wikimedia.org/r/187690 (https://phabricator.wikimedia.org/T85909) [16:27:38] (03PS16) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [16:28:11] (03PS1) 10Chad: Gerrit: limit max object size to 50M [puppet] - 10https://gerrit.wikimedia.org/r/187691 [16:29:19] 3operations, ops-ulsfo: fan reversed on asw1-ulsfo - https://phabricator.wikimedia.org/T83978#1004360 (10Gage) Thanks for the reminder. Onsite visit was Weds, 2015-01-28. I updated the procurement ticket in https://rt.wikimedia.org/Ticket/Display.html?id=8596 but forgot to also update Phab: Wrong part was recei... [16:33:16] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:34:10] 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#1004362 (10GWicke) @akosiaris, can we allow connections from tin, so that we can deploy from the deploy host? [16:35:05] PROBLEM - DPKG on cp4002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:35:06] PROBLEM - DPKG on cp4001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:35:06] PROBLEM - DPKG on cp4003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:35:12] <_joe_> mh [16:35:15] PROBLEM - DPKG on cp4004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:36:06] RECOVERY - DPKG on cp4001 is OK: All packages OK [16:36:06] RECOVERY - DPKG on cp4002 is OK: All packages OK [16:36:06] (03CR) 10Rush: [C: 031] "seems good to me assuming this allows direct push for new repo importing etc :)" [puppet] - 10https://gerrit.wikimedia.org/r/187691 (owner: 10Chad) [16:36:16] RECOVERY - DPKG on cp4003 is OK: All packages OK [16:36:16] RECOVERY - DPKG on cp4004 is OK: All packages OK [16:37:47] 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#1004363 (10akosiaris) Sure we can but isn't parsoid deployed via trebuchet which does not use SSH at all ? [16:38:45] PROBLEM - DPKG on cp3020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:39:01] (03CR) 10Chad: "It'll also affect new repo imports but why would any repo have a 50M object? For objects that large you should be using git-fat or somethi" [puppet] - 10https://gerrit.wikimedia.org/r/187691 (owner: 10Chad) [16:39:40] (03CR) 10Rush: "yes, I was thinking total repo size not object size :) You are most correct I think." [puppet] - 10https://gerrit.wikimedia.org/r/187691 (owner: 10Chad) [16:48:07] !log expect more icinga "CRITICAL: DPKG CRITICAL" on cache nodes for a while; applying backlog of upstream pkg updates slowly to all [16:48:12] Logged the message, Master [17:01:41] (03CR) 10Dzahn: [C: 031] "reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/187691 (owner: 10Chad) [17:06:12] <^d> mutante: We could go ahead and merge that I guess [17:06:22] <^d> It'll restart gerrit but there's never a good time to do that [17:06:33] <^d> Now's as bad as any :p [17:11:03] hey, it's a friday, who cares?! [17:11:12] sarcasm hiding truth [17:12:18] we need to do /something/ on fridays [17:12:32] or I can go on happy friday afternoon now if you prefer ;-p [17:12:38] drinks! [17:13:27] <^d> Gerrit's slow enough right now, good/bad time as any [17:13:33] <^d> If anyone's feeling bold enough to press +2 [17:13:53] only one thing in the zuul pipeline: https://integration.wikimedia.org/zuul/ [17:14:06] (03CR) 10Mark Bergsma: [C: 032] Gerrit: limit max object size to 50M [puppet] - 10https://gerrit.wikimedia.org/r/187691 (owner: 10Chad) [17:14:10] :) [17:14:46] <^d> The puppet run on ytterbium should restart gerrit for us [17:15:25] do you want me to run it manually? [17:15:37] <^d> I'm already logged in, I can do it [17:15:40] k [17:16:07] <^d> !log running puppet on ytterbium, gerrit shall restart [17:16:14] Logged the message, Master [17:16:56] <^d> Ok, and we're back. [17:17:01] <^d> Thanks for the merge mark [17:19:14] w00t, didn't lose any build status reports [17:23:35] 3Parsoid, operations, Parsoid-Team: Multiple Parsoid crashes due to "Parse Error" on Russian Wikinews (WNRU) - https://phabricator.wikimedia.org/T88100#1004398 (10Arlolra) a:3Arlolra [17:32:39] 3Parsoid, operations, Parsoid-Team: Multiple Parsoid crashes due to "Parse Error" on Russian Wikinews (WNRU) - https://phabricator.wikimedia.org/T88100#1004415 (10Arlolra) @Kelson Yes, this is related to https usage. Same for uzwiki. Fixing ... [17:38:41] !log andyrussg Synchronized php-1.25wmf15/extensions/CentralNotice: Update CentralNotice (duration: 00m 06s) [17:38:45] Logged the message, Master [17:40:53] (03CR) 10Jforrester: [C: 04-1] Add $wgParsoid... variables (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187625 (owner: 10Matthias Mullie) [17:44:17] (03PS1) 10BBlack: add central systemctl daemon-reload exec [puppet] - 10https://gerrit.wikimedia.org/r/187701 [17:49:44] 3ops-eqiad, operations: Rack and setup graphite1001 - https://phabricator.wikimedia.org/T86939#1004486 (10Cmjohnson) a:5Cmjohnson>3fgiunchedi [17:56:08] 3ops-core: reclaim dysprosium for spare (was: server status) - https://phabricator.wikimedia.org/T83070#1004512 (10BBlack) [17:58:41] (03PS2) 10BBlack: add central systemctl daemon-reload exec [puppet] - 10https://gerrit.wikimedia.org/r/187701 [18:00:17] (03PS3) 10Dzahn: decom cp1037,cp1038,cp1039,cp1040 [puppet] - 10https://gerrit.wikimedia.org/r/187615 (https://phabricator.wikimedia.org/T87800) [18:01:59] (03CR) 10Dzahn: [C: 032] decom cp1037,cp1038,cp1039,cp1040 [puppet] - 10https://gerrit.wikimedia.org/r/187615 (https://phabricator.wikimedia.org/T87800) (owner: 10Dzahn) [18:05:35] PROBLEM - salt-minion processes on cp1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:06:35] PROBLEM - salt-minion processes on cp1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:11:21] 3ops-codfw, operations: graphite2001 stuck at boot with "scanning for devices" - https://phabricator.wikimedia.org/T88101#1004534 (10fgiunchedi) 5Open>3Invalid a:3fgiunchedi this is actually expected, resolving in favor of T84794 (deployment ticket) [18:15:41] PROBLEM - Varnish HTTP bits on cp3020 is CRITICAL: Connection refused [18:15:55] 3operations: migrate graphite to new hardware - https://phabricator.wikimedia.org/T85909#1004553 (10fgiunchedi) currently running rsync to transfer metrics changed in the last month to graphite1001, there's ~380k metrics changed in the last 30d and a parallel rsync is churning at ~3/s so ETA for the initial sync... [18:16:36] !log cp1037,cp1038,cp1039,cp1040 - disabled puppet, removed from icinga, revoked certs and salt key etc. decom [18:16:43] Logged the message, Master [18:20:48] Reedy: hi! got a sec? I'm stuck trying to revert my first deploy!!!! [18:21:03] anyone....? ^ [18:21:34] MaxSem: yt? ^ [18:21:42] paging greg-g^ [18:21:44] yup [18:21:49] hmm? [18:21:53] where are you AndyRussG ? [18:21:56] oh [18:22:06] MaxSem: back home in Montreal, and figuratively on tin [18:22:11] kekeke [18:22:27] should have tried this before leaving [18:22:38] It's my first deploy... Yes, true, had to run right after the dev summit tho [18:22:44] AndyRussG: yo can click the "revert" button on the relevant gerrit change, then deploy the result of that [18:23:11] it should just make a new change for you [18:23:19] The revert is in the right wmf branch in core, so it seems: [18:23:28] https://gerrit.wikimedia.org/r/#/c/187709/ [18:23:46] okay, what's the problem? [18:23:53] I somehow don't see it on tin [18:23:58] we can do a skype/hangout, btw [18:24:24] Yes you bet, thanks!! [18:24:28] is it under security patches? [18:24:42] because it's a subproject somehow [18:24:49] Hi [18:24:54] Did you just changes something people? [18:25:05] I'm getting a page with no CSS [18:25:13] Or at the very least a broken layout [18:25:23] If you made a change recently I sugget undoing it [18:25:41] yup, the change is there, buried under security patches [18:25:51] (03PS1) 10QChris: Temporarily keep 40 instead of 31 days of webrequest data [puppet] - 10https://gerrit.wikimedia.org/r/187713 [18:25:59] Well it borke [18:26:01] MaxSem: ah maybe that's it then... [18:26:13] BTW this is not on production generally, just test, mediawiki.org and 2 others [18:26:18] Https version of the site isn;'t supplying the style information [18:26:28] http version appears to be [18:26:28] works for me [18:26:44] Propogation issue? [18:26:54] Nothing significant is broken, other than a centralnotice campaign running that should show up on mediawiki that doesn't, that's why I'm reverting [18:26:56] I've tried the refersh the cached version issues [18:27:16] MaxSem: Should I call u via Hangout? [18:27:19] AndyRussG: I am seeing no styling at all [18:27:21] !log rebooting cp3020, something's all wrong there... [18:27:28] Logged the message, Master [18:27:29] AndyRussG, I see no problem [18:27:36] just push your revert now [18:27:39] Qcoder00: what? [18:28:04] The view I have Wikipedia on https right now has no tabs , and no styles [18:28:05] bblack: could what you just did affect users not seeing CSS? [18:28:19] Qcoder00, are you in europe? [18:28:24] Yes [18:28:26] UK [18:28:29] hehe [18:28:32] PROBLEM - puppet last run on cp3020 is CRITICAL: Connection refused by host [18:28:40] then it's a cp3020 issue likely [18:28:48] Qcoder00: that wasn't me, I didn't push anything to Wikipedia [18:28:48] and so far this only seems to be https related [18:29:04] It's annoying because it breaks Proofread page stuff at Wikisource [18:29:18] MaxSem: Ah I see, right [18:29:32] greg-g: the error itself I'm responding to could have caused css problems for a fraction of users hitting esams, yes [18:29:37] (03PS2) 10QChris: Temporarily keep 40 instead of 31 days of webrequest data [puppet] - 10https://gerrit.wikimedia.org/r/187713 [18:29:51] MaxSem: So I'll push out the security patches too, I guess? Just push out the latest everything in the wmf15 branch? [18:30:00] (to the places it should go) [18:30:04] cp3020 is now disabled in pybal though, so the issue should be no more [18:30:18] cp3020 being a server? [18:30:20] OK [18:30:21] AndyRussG, they're already there [18:30:22] Thanks [18:30:31] right [18:30:33] yes [18:30:38] yes, it's a "bits" frontend cache, which specifically handle css and such [18:30:39] I assume a team of server ninjas ahs been dispatched? [18:30:39] bblack: ty [18:30:42] you just never touch anything you're not actually deploying [18:31:08] Qcoder00: in some form or another, yes, known and being worked on (it should be fixed for you "now") [18:31:16] It is [18:31:22] PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100% [18:31:42] we had another report on -tech about 500 from bits. user in Europe as well. but confirmed fixed since you disabled it in pybal [18:31:52] greg-g, speaking of first deployments, I would like to teach phuedx to deploy today while he's still here. is a no-op push ok later today? [18:31:52] RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 88.15 ms [18:31:52] 3operations, ops-codfw: rack graphite2001 - https://phabricator.wikimedia.org/T86554 (10Cmjohnson) Enabled network port ge-5/0/1 in row B added private1 vlan. Chris [18:32:11] yeah I went to check on that machine when it showed up funny in icinga, and it was having all sorts of crazy issues with hung processes on disk i/o. it may have a hardware failure [18:32:19] we'll see [18:32:32] RECOVERY - Varnish HTTP bits on cp3020 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.179 second response time [18:32:51] bblack: Thanks [18:32:59] heh it booted off its local disks seemingly-ok, but then also: [18:33:00] ata_id[584]: HDIO_GET_IDENTITY failed for '/dev/sda': Invalid argument [18:33:03] ata_id[585]: HDIO_GET_IDENTITY failed for '/dev/sdb': Invalid argument [18:33:06] :) [18:33:16] You have backups? XD [18:33:39] we don't need them, these are stateless machines [18:33:47] but the missing capacity is not ideal [18:33:57] I'm suprised to see an erorr like this... [18:34:10] I thought you had fall over code? [18:34:21] anway fixed for now [18:34:23] Thanks [18:34:25] :) [18:34:27] * Qcoder00 out [18:34:31] we do, but some failures are tricky when they're strange and partial like this [18:34:37] the service was running, sort-of :) [18:34:39] !log andyrussg Synchronized php-1.25wmf15/extensions/CentralNotice: Revert update to CentralNotice (duration: 00m 06s) [18:34:43] Logged the message, Master [18:35:01] MaxSem: ^ I think that's it :) [18:35:17] the health checks from the front LBs probably don't check very deeply for correctness, just responsiveness [18:35:52] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [18:36:21] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [18:36:54] !log removing cp1063 from pybal [18:36:59] Logged the message, Master [18:37:55] Hey opsen, do we have a dashboard or something somewhere where we measure the rate of 503s from bits? [18:39:38] MaxSem: all clear! fixed the issue, thank you for your help!!!! :D [18:39:43] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0 [18:39:47] :P [18:40:02] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [18:40:02] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [18:40:49] !log initial rsync from tungsten to graphite1001 T85909 [18:40:54] Logged the message, Master [18:42:51] RoanKattouw: well there's https://gdash.wikimedia.org/dashboards/reqerror/ [18:42:57] but that doesn't break out bits [18:45:10] 3RESTBase, operations, Scrum-of-Scrums, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1004684 (10Tnegrin) Hi Gabriel -- do you have a target date for this? 2/15 is Sunday; can we discuss a better day? thanks, -Toby [18:46:02] (03CR) 10Ottomata: [C: 032 V: 032] Temporarily keep 40 instead of 31 days of webrequest data [puppet] - 10https://gerrit.wikimedia.org/r/187713 (owner: 10QChris) [18:46:12] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [18:47:26] MaxSem: (sorry for delay, in a meeting) yep! [18:49:16] 3ops-codfw, operations: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#1004689 (10Papaul) Racked the 30 first mw servers [18:50:22] 3ops-codfw, operations: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#1004701 (10Papaul) Racked 20 wtp servers [18:50:58] nice @ racked servers [18:51:26] !log cp1037,cp1038 - shut down [18:51:33] Logged the message, Master [18:52:41] RECOVERY - DPKG on cp3020 is OK: All packages OK [18:53:42] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:55:41] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:56:08] !log rebooting cp3020 again (still depooled) [18:56:12] Logged the message, Master [18:57:32] PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100% [18:57:42] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:58:51] RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 96.18 ms [18:59:52] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:01:11] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [19:02:05] hey _joe_!, qq for you if you are around [19:02:50] !log cp1039, cp1040 - shut down [19:02:58] Logged the message, Master [19:03:53] ^d: can i be added to the deployers group in gerrit? [19:04:03] currently trying to self +2 a cherry-pick [19:04:06] (and can't) [19:13:56] 3RESTBase, Scrum-of-Scrums, operations: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1004747 (10GWicke) @robh, any news? We are aiming for a release before mid-February. VE performance work (top priority project) depends on RESTBase being available ASAP, so moving fast on this would... [19:15:16] 3RESTBase, operations, Scrum-of-Scrums, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1004751 (10GWicke) @tnegrin, we are mostly at the mercy of ops on this (T76986 and T78194). The hardware should be arriving around now, just pinged @RobH about the current status on T76986. [19:15:47] (03CR) 10Dzahn: [C: 032] decom cp1037,cp1038,cp1039,cp1040 [dns] - 10https://gerrit.wikimedia.org/r/187626 (https://phabricator.wikimedia.org/T87800) (owner: 10Dzahn) [19:16:00] AndyRussG: Did you find someone to help you? [19:17:12] 3RESTBase, Scrum-of-Scrums, operations: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1004754 (10RobH) At the time of order, it was a 2-3 week lead time for shipment. As that has passed and I have no further update, I've pinged our HP VAR via email (just now.) I'll update ticket with... [19:19:04] 3operations: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#1004772 (10Dzahn) disabled puppet and salt-minions revoked puppet certs revoked salt keys delete from puppet stored configs (removed from icinga) shut them down ... removed from DNS .. [19:19:07] 3RESTBase, Scrum-of-Scrums, operations: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1004773 (10GWicke) @RobH, thanks! [19:21:30] 3operations: Put archiva.wikimedia.org behind misc-web-lb and force https - https://phabricator.wikimedia.org/T88139#1004787 (10Ottomata) 3NEW a:3Ottomata [19:21:47] marktraceur: yep! all set :) thanks much! [19:22:07] it was just a deploy to mw.org and test wikis, and just a CentralNotice banner issue [19:22:47] * AndyRussG is glad for his collegaues' having urged prudence! [19:24:14] 3operations: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#1004802 (10Dzahn) can we use the same ticket to finish the workflow nowadays ? (instead of creating linked tickets). now it's not separate queues anymore but just adding the relevant tags. i suggest to just "move" this ove... [19:27:08] 3Phabricator, obsolete, operations: merge tickets in project "ops-core" into project "operations" - https://phabricator.wikimedia.org/T87291#1004819 (10chasemp) [19:27:25] !log phuedx Synchronized php-1.25wmf15/extensions/MobileFrontend/: No-op deployment training (duration: 00m 06s) [19:27:32] Logged the message, Master [19:28:37] 3Ops-Access-Requests, Services: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1004829 (10GWicke) @RobH, @Mark: Any news on this? [19:29:29] 3RESTBase, Ops-Access-Requests, Services: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1004845 (10GWicke) [19:32:13] 3ops-eqiad, operations: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#1004864 (10Dzahn) [19:32:35] 3Phabricator, operations: merge tickets in project "ops-core" into project "operations" - https://phabricator.wikimedia.org/T87291#1004871 (10chasemp) [19:36:04] 3ops-eqiad, operations: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#1004885 (10Dzahn) the above servers have been removed from puppet, DNS and all that , see above. please continue with the decom/reclaim workflow with the things local to ops-eqiad, like wipe disk, derack physic... [19:47:05] (03PS1) 10Legoktm: Update objectcache logging settings for I8a8e278e6f028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187730 [19:50:42] 3Phabricator, operations: merge tickets in project "ops-core" into project "operations" - https://phabricator.wikimedia.org/T87291#1004937 (10chasemp) P245 This has been completed I believe. [19:51:01] 3Phabricator, operations: merge tickets in project "ops-core" into project "operations" - https://phabricator.wikimedia.org/T87291#1004938 (10chasemp) 5Open>3Resolved [19:54:18] (03PS1) 10Rush: ops-core is no longer an active project [puppet] - 10https://gerrit.wikimedia.org/r/187734 [19:56:35] (03CR) 10Rush: [C: 032] ops-core is no longer an active project [puppet] - 10https://gerrit.wikimedia.org/r/187734 (owner: 10Rush) [19:57:12] chasemp: :) [19:58:14] (03PS1) 10BryanDavis: Show logging config in noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187736 [20:07:02] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [20:28:02] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:38:56] 3Continuous-Integration: Puppet is causing changed/added files in 'slave-scripts' git::clone on integration slaves in labs to become root read-only - https://phabricator.wikimedia.org/T87843#1005000 (10Krinkle) [20:52:42] fyi: I'm taking the afternoon afk (dr's appt etc). I'm emailable but if someone pings re an emergency deploy, I'm not liable to respond quickly [20:53:53] 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#1005031 (10GWicke) @akosiaris, trebuchet (salt really) is not able to do rolling restarts reliably, so we are using dsh to actually apply the restart. [20:54:21] greg-g, ok. [20:55:39] 3Ops-Access-Requests: Add James F. to pager duty - https://phabricator.wikimedia.org/T88153#1005037 (10Jdforrester-WMF) 3NEW [20:58:11] !log reinstalling cp3020 (seems to have fs corruption issues, but may not be hardware...) [20:58:17] Logged the message, Master [20:58:23] (03CR) 10Chad: [C: 031] fix check_elasticsearch CRITICAL output [puppet] - 10https://gerrit.wikimedia.org/r/186418 (owner: 10Filippo Giunchedi) [20:58:49] (03CR) 10Chad: [C: 032] Show logging config in noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187736 (owner: 10BryanDavis) [21:00:22] PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:25] (03PS1) 10BBlack: cp3020 -> precise for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/187802 [21:01:00] (03CR) 10BBlack: [C: 032 V: 032] cp3020 -> precise for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/187802 (owner: 10BBlack) [21:01:55] (03CR) 10Aude: [C: 031] Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [21:02:18] (03Merged) 10jenkins-bot: Show logging config in noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187736 (owner: 10BryanDavis) [21:02:31] RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 96.15 ms [21:02:54] !log demon Synchronized docroot and w: (no message) (duration: 00m 10s) [21:02:58] Logged the message, Master [21:03:25] <^d> Hmmm [21:03:30] <^d> I see logging-labs.php [21:03:32] <^d> Not logging.php [21:04:18] !log demon Synchronized docroot/noc/conf/logging.php.txt: (no message) (duration: 00m 06s) [21:04:22] Logged the message, Master [21:04:33] <^d> There we go [21:04:39] <^d> sync-docroot skips symlinks? [21:05:02] PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100% [21:05:11] <^d> bd808|BUFFER: http://noc.wikimedia.org/conf/highlight.php?file=logging.php [21:07:12] RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 88.99 ms [21:07:38] (03CR) 10Chad: [C: 032] beta: Change ProfilerSimpleText to ProfilerXhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186604 (owner: 10BryanDavis) [21:08:36] docroot/noc ? [21:09:15] <^d> Yeah. [21:09:32] PROBLEM - Varnish HTTP bits on cp3020 is CRITICAL: Connection refused [21:09:42] PROBLEM - salt-minion processes on cp3020 is CRITICAL: Connection refused by host [21:09:42] PROBLEM - configured eth on cp3020 is CRITICAL: Connection refused by host [21:09:42] PROBLEM - Disk space on cp3020 is CRITICAL: Connection refused by host [21:09:42] PROBLEM - dhclient process on cp3020 is CRITICAL: Connection refused by host [21:09:42] PROBLEM - RAID on cp3020 is CRITICAL: Connection refused by host [21:09:42] PROBLEM - puppet last run on cp3020 is CRITICAL: Connection refused by host [21:09:42] PROBLEM - DPKG on cp3020 is CRITICAL: Connection refused by host [21:09:51] PROBLEM - HTTPS on cp3020 is CRITICAL: Return code of 255 is out of bounds [21:11:56] (03CR) 10Hashar: "Filled as T88093 by Krenair" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186319 (owner: 10Jdlrobson) [21:12:25] ^d: thanks! [21:16:23] <^d> bd808: yw [21:16:48] oh, wow... [21:17:06] 3 reviewers didn't catch that? [21:18:37] Krenair: yeah. sadly easy mistake to make in our config soup [21:23:22] RECOVERY - dhclient process on cp3020 is OK: PROCS OK: 0 processes with command name dhclient [21:23:22] RECOVERY - Disk space on cp3020 is OK: DISK OK [21:23:22] RECOVERY - configured eth on cp3020 is OK: NRPE: Unable to read output [21:23:22] RECOVERY - salt-minion processes on cp3020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:23:22] RECOVERY - DPKG on cp3020 is OK: All packages OK [21:23:23] RECOVERY - RAID on cp3020 is OK: OK: no disks configured for RAID [21:23:54] (03Merged) 10jenkins-bot: beta: Change ProfilerSimpleText to ProfilerXhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186604 (owner: 10BryanDavis) [21:25:22] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 4 failures [21:27:21] RECOVERY - Varnish HTTP bits on cp3020 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.177 second response time [21:29:32] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:33:42] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [21:35:52] PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:51] RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 93.33 ms [21:37:02] RECOVERY - HTTPS on cp3020 is OK: SSLXNN OK - 36 OK [21:38:39] !log killed diamond taking up 100% on labstore1001 [21:38:41] andrewbogott: ^ [21:38:44] Logged the message, Master [21:39:00] YuviPanda|flight: I killed it for a bit yesterday, it didn’t seem to help much. [21:39:07] Although I’m still curious why it was so CPU hungry [21:39:19] YuviPanda|flight: won’t puppet restart it shortly? [21:39:33] andrewbogott: yeah. an strace didn’t help much [21:40:06] (03PS3) 10Dzahn: add IPv6 interface to dataset1001 (eth2) [puppet] - 10https://gerrit.wikimedia.org/r/187121 (https://phabricator.wikimedia.org/T68996) [21:40:49] it was more like rpcd was taking up the CPU on labstore [21:43:41] svc: transport ffff8800952b7000 busy, not enqueued [21:43:46] probably related [21:45:28] Brb helping out a lost person [21:52:35] !log re-pooling cp3020 (bits cache esams) - reinstalled, looks sane... [21:52:42] Logged the message, Master [21:54:55] (03PS1) 10Phuedx: Configure JS console recruitment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187815 (https://phabricator.wikimedia.org/T85815) [22:00:51] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [22:02:42] (03CR) 10BryanDavis: [C: 031] Update objectcache logging settings for I8a8e278e6f028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187730 (owner: 10Legoktm) [22:12:22] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1005216 (10Chmarkine) [22:20:31] (03PS1) 10BBlack: cp1047 -> out for hardware T88045 [puppet] - 10https://gerrit.wikimedia.org/r/187821 [22:20:59] (03CR) 10BBlack: [C: 032 V: 032] cp1047 -> out for hardware T88045 [puppet] - 10https://gerrit.wikimedia.org/r/187821 (owner: 10BBlack) [22:22:37] while looking around at pybal cleanup stuff: anyone know the status of mw1018 + mw1118 being disabled for eqiad/apaches + eqiad/api, respectively? [22:22:46] they're local changes for pybal, but currently uncommitted [22:25:51] nevermind, I found those in the SAL, going to commit them with the SAL msgs [22:28:38] (03PS1) 10Jdlrobson: Add wikidata to central auth config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187822 [22:28:40] (03PS1) 10Jdlrobson: Enable JS console recruitment on mobile. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187823 (https://phabricator.wikimedia.org/T85815) [22:29:04] (03PS2) 10Jdlrobson: Enable JS console recruitment on mobile. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187823 (https://phabricator.wikimedia.org/T85815) [22:29:47] (03Abandoned) 10Jdlrobson: Add wikidata to central auth config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187822 (owner: 10Jdlrobson) [22:30:16] (03CR) 10Jdlrobson: [C: 04-1] "Think this is in wrong place. See https://gerrit.wikimedia.org/r/187823" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187815 (https://phabricator.wikimedia.org/T85815) (owner: 10Phuedx) [22:31:01] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:31:01] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:33:52] ^ oops that was me [22:34:11] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:34:11] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [22:34:48] (03CR) 10Aude: "if it gets unabandoned..." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187822 (owner: 10Jdlrobson) [22:38:40] !log deployed parsoid version 2abd0eb6 [22:38:45] Logged the message, Master [22:56:33] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 11 failures [23:02:12] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: puppet fail [23:05:12] 3Wikimedia-Labs-Other, operations: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1005375 (10yuvipanda) [23:14:12] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [23:22:02] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [23:25:52] 3RESTBase, Ops-Access-Requests, Services: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1005459 (10mobrovac) p:5Triage>3High [23:36:04] 3RESTBase, operations: Public entry point for RESTBase - https://phabricator.wikimedia.org/T78194#1005556 (10mobrovac) p:5Triage>3High [23:36:31] 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005558 (10yuvipanda) 3NEW [23:39:44] 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005571 (10Merl) [23:39:47] 3Wikimedia-Labs-Other, operations: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1005570 (10Merl) [23:41:16] 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005582 (10yuvipanda) Hmm, there's no 1:1 correspondence between the slices that are lagging and the ones that are reporting deadlock errors (s2 in particular). There was a huge spike in network traffic / CPU usage on db1069 sinc... [23:43:45] (03CR) 10Phuedx: "I had no idea mobile.php was a thing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187823 (https://phabricator.wikimedia.org/T85815) (owner: 10Jdlrobson) [23:49:58] 3RESTBase, operations, Scrum-of-Scrums, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1005604 (10mobrovac) [23:53:31] 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005606 (10yuvipanda) replag is still growing, but my battery is running out, the aiport people are looking at me suspiciously, and my brain isn't at its best after a 26h flight. Hopefully @springle can take a look soon, if not I...