[00:11:10] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [00:22:59] maplebed: so....check this out: http://wikitech.wikimedia.org/view/Software_deployments#Week_of_February_20 ...in particular, look at the proximity of the 1.19 commons deployment to the ramp-up of thumb traffic on Swift [00:23:22] k. [00:24:16] that feels a little precarious if we actually both stick with our respective schedules [00:24:52] robla: I want to start moving traffic to swift sooner. [00:25:33] (AaronSchulz may want to pay attn to this conversation) [00:25:47] the cluster is being populated now. [00:26:03] maplebed: what is the state of the hardware orders? [00:26:05] it'll be ready for a small % of production traffic later this week [00:26:12] so here's the deal. [00:26:21] ms5 is currently saturated at 100% IO load [00:26:30] I converted the pmtpa performance test cluster to be "production". [00:26:35] because of how swift lets you fail out nodes, [00:26:47] when the real hardware arrives, I'll fail out a node at a time, replacing it with the real production hardware. [00:26:57] this lets us use swift immediately (and take load off ms5) [00:27:15] and still use the production-grade hardware (instead of the old sun hardware currently in the test cluster) [00:27:21] [00:29:01] maplebed: looks like the Software_deployments page probably needs to be updated then [00:29:28] likely true. [00:30:03] especially if anything is slated to go into production this week [00:30:15] woosters: ^^ [00:30:27] ya [00:31:30] maplebed: can you linky to the relevant docs for failing out nodes? [00:32:26] AaronSchulz: I'm going to have our playbook here: http://wikitech.wikimedia.org/view/Swift/How_To#Remove_a_failed_storage_node_from_the_cluster [00:32:52] woosters: can you and maplebed work out who is going to update the Software_deployments page for the new Swift plans, and make sure that at least this week is covered well in advance of actually pushing something? [00:32:54] but it's basically just failing out all the drives using swift-ring-builder (http://manpages.ubuntu.com/manpages/oneiric/man8/swift-ring-builder.8.html) [00:33:30] robla: I'd rather leave that to you, since anything getting "pushed" will be Aaron pushing something before I can touch squid to redirect traffic. [00:33:57] but I agree we should talk through the schedule in more detail, even if it's just event triggers rather than dates [00:34:08] (does that make sense?) [00:35:44] done [00:35:52] reflected on deployment page [00:36:35] I'm not sure I'd call it done to say that we're going to do it sometime this week [00:37:08] do you mind if we switch from IRC to IRL for a few minutes? [00:37:14] k [00:47:10] New review: tstarling; "(no comment)" [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2142 [01:36:03] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 1321 seconds [01:40:56] New review: tstarling; "Looks good overall, it's certainly much better than the collection of filters that came before. Ther..." [analytics/udp-filters] (master) C: 0; - https://gerrit.wikimedia.org/r/2142 [01:43:33] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:48:09] so here is a fun one https://lists.wikimedia.org/mailman/listinfo/mobile-feedback-l (The current archive is only available to the list members.) .. http://www.all-mail-archive.com/mobile-feedback-l@lists.wikimedia.org/1-1201201071/Re-Mobile-Feedback-A-couple-of-things [01:48:12] not so private after all [01:48:18] RobH: thoughts? [01:49:05] i bet they subscribed an email address and are piping data through .. but mailman logins just hang [01:49:07] so i can't check [02:08:23] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 865s [02:10:31] Ryan_Lane: would you mind looking at http://wikitech.wikimedia.org/view/Swift/Deploy_Plan_-_Thumbnails#squid_changes and telling me if it looks right to you? [02:10:45] sure [02:11:05] weren't we thinking of switching to varnish at the same time? [02:11:11] nope. [02:11:15] ah ok [02:11:23] mark brought up the idea, but we decided not to. [02:11:41] heh [02:11:46] I gotta admit, I don't know squid very well ;) [02:11:50] instead we will build out varnish in front of an eqiad cluster and have pmtpa be fronted by swift, eqiad by varnish, and be able to compare them. [02:12:02] ok, I'll send out a note to ops. [02:12:10] I've mostly ignored squid since we are moving to varnish [02:12:20] excluding minor changes here and there [02:21:21] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1642s [02:30:01] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:21] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:43:41] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:44:41] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:31] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 39s [03:07:11] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.920 seconds [03:15:50] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 120 MB (1% inode=60%): /var/lib/ureadahead/debugfs 120 MB (1% inode=60%): [03:15:50] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 17 MB (0% inode=60%): /var/lib/ureadahead/debugfs 17 MB (0% inode=60%): [03:17:51] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.744 seconds [03:27:00] RECOVERY - Disk space on srv223 is OK: DISK OK [03:49:40] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [03:51:30] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:51:50] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:02:50] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.421 seconds [04:05:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.750 seconds [04:07:23] RECOVERY - Disk space on srv221 is OK: DISK OK [04:18:33] RECOVERY - Disk space on srv219 is OK: DISK OK [04:23:23] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:23:33] RECOVERY - Disk space on es1004 is OK: DISK OK [04:39:33] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:57:25] New patchset: tstarling; "Disable wmerrors log_backtrace since it is buggy and causes segfaults." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2152 [04:57:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2152 [04:57:57] New review: tstarling; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2152 [05:02:56] New review: tstarling; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2152 [05:42:38] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:37] Change abandoned: tstarling; "superseded by configuration pushed out with dsh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2152 [05:48:08] PROBLEM - Full LVS Snapshot on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:54] RECOVERY - Full LVS Snapshot on db42 is OK: OK no full LVM snapshot volumes [05:57:04] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:08:14] RECOVERY - MySQL disk space on db42 is OK: DISK OK [07:12:33] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:23:18] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [08:23:18] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [09:15:33] * apergos wonders if dataset1001 got its new raid card yet [09:17:47] hmm perhaps not [09:17:52] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [09:31:22] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:35:22] nope. bummer [09:46:52] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:52] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:10] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:55:10] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 404454 MB (3% inode=99%): [09:58:30] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 383892 MB (3% inode=99%): [10:01:07] RECOVERY - DPKG on db42 is OK: All packages OK [10:01:07] RECOVERY - Disk space on db42 is OK: DISK OK [10:01:07] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [10:22:00] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [10:23:24] New review: Dzahn; "https://developer.mozilla.org/en/Mobile/Viewport_meta_tag" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2110 [10:23:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2110 [10:31:12] New review: Dzahn; "looks good and already got a +2, just not verified" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2109 [10:31:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2109 [11:08:43] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:11:23] RECOVERY - MySQL slave status on es1004 is OK: OK: [11:11:53] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:43] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:43] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:13:23] PROBLEM - mysqld processes on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:22:43] PROBLEM - RAID on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:43] PROBLEM - Full LVS Snapshot on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:53] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:23] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:33:23] PROBLEM - MySQL Recent Restart on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:08] New patchset: Tim Starling; "Disable wmerrors since it is buggy and causes segfaults." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2154 [11:50:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2154 [11:52:20] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2154 [11:52:26] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2154 [11:52:27] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2154 [12:14:59] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [12:57:20] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [12:57:20] RECOVERY - DPKG on db42 is OK: All packages OK [12:57:50] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [12:59:30] RECOVERY - MySQL disk space on db42 is OK: DISK OK [12:59:30] RECOVERY - Full LVS Snapshot on db42 is OK: OK no full LVM snapshot volumes [13:01:20] RECOVERY - MySQL Idle Transactions on db42 is OK: OK longest blocking idle transaction sleeps for 0 seconds [13:03:10] RECOVERY - RAID on db42 is OK: OK: State is Optimal, checked 2 logical device(s) [13:09:00] RECOVERY - Disk space on db42 is OK: DISK OK [13:14:20] RECOVERY - MySQL Recent Restart on db42 is OK: OK 3416479 seconds since restart [13:45:50] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:10] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:30:01] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:11] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:11] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:20:58] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.968 seconds [15:21:08] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue Jan 31 15:20:53 UTC 2012 [15:59:38] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.373 seconds [16:03:18] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:03:48] PROBLEM - Host ekrem is DOWN: CRITICAL - Host Unreachable (208.80.152.178) [16:26:28] RECOVERY - Host ekrem is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [16:36:48] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:56:08] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:03:08] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:18] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.954 seconds [17:29:58] RECOVERY - MySQL Idle Transactions on db42 is OK: OK longest blocking idle transaction sleeps for 0 seconds [17:30:18] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:18] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:38] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:38] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [18:33:38] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [18:52:18] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:52:30] RobH: around? [18:53:58] Jeff_Green: am now, was out snagging foodz [18:54:12] hey, so how's our demo situation with Dell? [18:54:45] we really never demo anything, mostly cuz we are sure before we order ;) (well, mostly sure, and quite lucky) [18:54:47] on CT's request I've been talking to the sales guy I worked with for CL at FusionIO and it sounds like their loaner program has significantly wound down [18:54:52] lets discuss in private [18:54:55] k [18:55:28] !log restarted squid and lighttpd on brewster [18:55:30] Logged the message, Master [18:55:33] that box is starting to piss me off [18:56:18] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.841 seconds [19:08:22] Morning binasher [19:08:45] hello [19:08:46] Greg updated that .sql file in SVN, the schema change should be good to go now [19:09:06] I won't be around much longer, I'm gonna get some sleep after this IRC meeting I'm in [19:13:58] RoanKattouw: and just to verify, this is only applicable where wmgUseArticleFeedbackv5 == true? [19:14:43] Yes [19:14:49] So it's only enwiki, testwiki and en_labswikimedia IIRC [19:14:52] is there any kind of versioning attached to the feature release tomorrow that i can reference in the updatelog table? [19:14:58] No [19:15:11] You can just make the changes and the old code will continue to work [19:15:15] ok [19:15:29] will email when done [19:15:47] testwiki and en_labswikimedia probably won't need any special treatment, they're small [19:16:00] New patchset: Pyoungmeister; "adding support for other paging groups into our nagios. testing with the udp2log alerts. maybe also breaking nagios ;)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2155 [19:16:41] can i set wmgUseArticleFeedbackv5 to false for enwiki to disable the feature temporarily while applying to the master with log_bin disabled? [19:17:15] Sure [19:17:34] I guess that's an easy way to do it without a master switch [19:26:43] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2155 [19:26:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2155 [19:30:18] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:36] New patchset: Diederik; "Initial commit, feedback Catrope incorporated, feedback Tim Starling incorporated" [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2142 [19:46:28] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [19:48:38] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:28] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.442 seconds [19:57:52] !log restarting puppetmaster proc as it's serving up 500s to all clients (well, 3 randomly selected ones...) [19:57:53] Logged the message, and now dispaching a T1000 to your position to terminate you. [19:58:14] !log on stafford, that is [19:58:15] Logged the message, and now dispaching a T1000 to your position to terminate you. [20:25:49] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.274 second response time [20:32:39] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [20:34:36] hey ops folks, to whom do I direct a request to get an entry for donate.wikimedia.org in the interwiki lists? [20:35:00] pgehres: http://meta.wikimedia.org/wiki/Interwiki_map [20:36:49] RoanKattouw: thanks, it says we shouldn't add Foundation projects there, is this an exception to that rule? [20:36:58] Is donate.wm.o a wiki? [20:37:05] it is :-) [20:37:18] its on the cluster [20:37:21] Right [20:37:26] Maybe it should be in the interwiki map script instead [20:39:07] where does that map live? [20:39:27] In a PHP file somewhere, lemme see [20:40:03] hmm [20:40:12] WMF wikis can be on that map just fine [20:40:20] Just add it there, then file a bug to have the interwiki cache updated [20:40:40] okay, thanks very much [20:55:09] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.463 seconds [20:57:56] !log temp stopping puppet on emery for testing [20:57:58] Logged the message, and now dispaching a T1000 to your position to terminate you. [21:14:49] PROBLEM - udp2log log age on emery is CRITICAL: CRITICAL: log files /fake.log, have not been written to in 6 hours [21:14:59] PROBLEM - udp2log processes on emery is CRITICAL: CRITICAL: filters absent: /var/local/fakefilter, [21:23:37] !log restarting puppet on emery [21:23:38] Logged the message, and now dispaching a T1000 to your position to terminate you. [21:38:55] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:38:59] New patchset: Pyoungmeister; "adding nimish and erikz to alert group for udp2log checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2158 [21:39:28] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2158 [21:39:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2158 [21:47:45] RECOVERY - udp2log log age on emery is OK: OK: all log files active [21:48:36] diederik: one more question about the udp2log checks [21:48:36] do you also want me to check the udp2log-aft process and it's filters/logs? [21:48:46] shoot [21:49:00] yes! [21:49:15] ok. cool. I'll put that into place, and then we shall call it done? [21:49:20] and for those, please add dario, fabrice and me as contacts [21:49:33] i think so! [21:49:54] oh, uh, I would greatly prefer to not segment alert groups that way. I can if it's really important, though [21:50:21] nimish and erikz don't care about aft logs :( [21:50:25] RECOVERY - udp2log processes on emery is OK: OK: all filters present [21:50:32] dario and fabrice really care about that filter :) [21:50:43] do they have shells on emery/locke? [21:51:00] basically, I would love it if all of the people with shells got alerts for either one [21:51:11] so that if things just need to be restarted or some such [21:51:17] more people will be able to do so [21:52:08] i don't think so [21:52:17] maybe dario has shell [21:52:41] basically, I'm mildly against emails going out to people who can't fix the problem [21:52:45] as ops will already be aware [21:54:07] and anyone will be able to see what's going on via the web interface [22:12:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.384 seconds [22:26:51] New review: Tim Starling; "(no comment)" [analytics/udp-filters] (master) C: 0; - https://gerrit.wikimedia.org/r/2142 [22:30:23] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:23] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:23] PROBLEM - check_minfraud3 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:23] PROBLEM - check_minfraud3 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:33] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [22:35:23] RECOVERY - check_minfraud2 on payments1 is OK: HTTP OK: HTTP/1.1 200 OK - 8643 bytes in 0.253 second response time [22:35:23] RECOVERY - check_minfraud3 on payments3 is OK: HTTP OK: HTTP/1.1 200 OK - 8644 bytes in 0.225 second response time [22:35:23] RECOVERY - check_minfraud2 on payments4 is OK: HTTP OK: HTTP/1.1 200 OK - 8643 bytes in 0.219 second response time [22:35:23] RECOVERY - check_minfraud3 on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 8644 bytes in 0.269 second response time [22:39:13] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:01:23] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.728 seconds [23:09:26] New patchset: Diederik; "Initial commit, feedback Catrope incorporated, feedback Tim Starling (2x) incorporated" [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2142 [23:22:15] notpeter: i understand your pov about being against emails going out to people who can't fix the problem but the reason that we need these emails is that too often we miss the current warnings [23:59:43] New review: Tim Starling; "(no comment)" [analytics/udp-filters] (master) C: 0; - https://gerrit.wikimedia.org/r/2142