[00:28:32] (03PS1) 10Eevans: WIP: certificate/keystore generation script [puppet] - 10https://gerrit.wikimedia.org/r/236389 (https://phabricator.wikimedia.org/T108953) [00:36:40] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1612099 (10Eevans) The [[https://gerrit.wikimedia.org/r/236389|attached Gerrit]] is for a script to generate a root CA and signed keystores, based on the conte... [00:42:50] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Puppet has 1 failures [01:07:48] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:33:47] (03PS1) 10Eevans: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/236391 (https://phabricator.wikimedia.org/T106619) [02:19:54] !log l10nupdate@tin Synchronized php-1.26wmf21/cache/l10n: l10nupdate for 1.26wmf21 (duration: 06m 14s) [02:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:09] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf21) at 2015-09-06 02:23:08+00:00 [02:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:35:46] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:06] PROBLEM - puppet last run on elastic1014 is CRITICAL: CRITICAL: puppet fail [03:36:56] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:56] PROBLEM - puppet last run on ytterbium is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:56] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [03:51:56] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [03:53:05] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 2.28 ms [04:00:56] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [04:00:56] RECOVERY - puppet last run on ytterbium is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [04:01:37] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:05] RECOVERY - puppet last run on elastic1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:02:56] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:27:57] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Sep 6 04:27:57 UTC 2015 (duration 27m 56s) [04:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:31:24] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:44] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:54] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:04] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:15] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:24] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:34] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:24] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:55:45] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:57:04] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:24] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:24] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:59:04] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:42] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [08:44:51] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 35.14 ms [14:18:01] (03PS1) 10MarcoAurelio: Enable Extension:EducationProgram on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236418 (https://phabricator.wikimedia.org/T111630) [14:22:42] (03PS12) 10Merlijn van Deen: toollabs: add script to generate python package listings [puppet] - 10https://gerrit.wikimedia.org/r/228635 (https://phabricator.wikimedia.org/T101646) [14:22:47] (03PS1) 10Merlijn van Deen: toollabs: add python-pyicu [puppet] - 10https://gerrit.wikimedia.org/r/236419 (https://phabricator.wikimedia.org/T102165) [14:22:47] (03PS1) 10Merlijn van Deen: toollabs: add python-enum34 [puppet] - 10https://gerrit.wikimedia.org/r/236420 (https://phabricator.wikimedia.org/T111602) [14:22:49] (03PS1) 10Merlijn van Deen: toollabs: add python-pil [puppet] - 10https://gerrit.wikimedia.org/r/236421 (https://bugzilla.wikimedia.org/108210) [15:15:22] 6operations: ircecho should support nickserv registration - https://phabricator.wikimedia.org/T48254#1612565 (10Krenair) [15:15:23] 6operations, 7Icinga: register a nickserv account for icinga-wm - https://phabricator.wikimedia.org/T22771#1612564 (10Krenair) [16:37:31] 6operations, 6Performance-Team, 7Graphite, 5Patch-For-Review: "sum" aggregation broken in Graphite - https://phabricator.wikimedia.org/T111170#1612609 (10Krinkle) Seems to work as expected. Last 7 days (unaggregated): https://graphite.wikimedia.org/render/?from=-7days&until=-1h&target=legendValue(mw.js.de... [16:48:14] (03CR) 10Krinkle: "Per TTO, and others on the discussion at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_136#Edit_Tags , it seems" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/218353 (https://phabricator.wikimedia.org/T97013) (owner: 10Cenarium) [16:50:28] Krenair: I'm trying to figure out how to fix https://phabricator.wikimedia.org/T111570 but I can't see the wiki on any dblist I know of. [16:50:53] checked all.dblist, special, fishbowl, nonglobal... nothing [16:51:21] mafk, that's why it's marked as Wikimedia-DC rather than Wikimedia-Site-Requests :) [16:51:26] It's on the WMDC server, not WMF servers [16:51:45] aaahh [16:51:48] right [16:52:06] ktnx [16:52:16] If you open the domain mentioned in your browser, it doesn't go to WMF and a big clue is that it's not HTTPS-only [16:52:49] CC'd on that task are some WMDC roots, so I'm sure it'll get done soon [17:25:23] (03CR) 10Cenarium: "@Krinkle: Do you mean doing Change-Id: I27946be614e740add064baee3029c2ae2754ee19 (restrict by default to bot and sysop) ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/218353 (https://phabricator.wikimedia.org/T97013) (owner: 10Cenarium) [18:03:10] PROBLEM - very high load average likely xfs on ms-be1003 is CRITICAL: CRITICAL - load average: 241.59, 153.93, 74.42 [18:42:08] 6operations: rename gerrit2 account in LDAP - https://phabricator.wikimedia.org/T80648#1612683 (10Krenair) @demon? [18:46:17] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#1612685 (10Krenair) >>! In T85913#1363840, @demon wrote: >>>! In T85913#1362894, @Krenair wrote: >>>>! In T85913#1018041, @demon wrote: >>> We have done i... [19:27:11] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: puppet fail [19:29:48] Going on for a couple hours now https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&c=Swift+eqiad&h=ms-be1003.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS [19:33:22] ms-be... that'd be media storage backend I think? [19:33:32] for swift file storage [19:33:46] Krenair: yeah it is [19:34:28] well, nothing in fluorine:/a/mw-log/SwiftBackend.log about it [19:35:41] high CPU and network's dropped off according to Ganglia, likely needs a reboot [19:35:45] Most other swift servers reduced CPU usage at the same time as ms-be1003 started running in circles. [19:36:16] hhvm.log is getting filed with something file-related [19:37:18] looks like it can't unserialise some exif data. don't know if it's relevant [19:39:19] or at least, something including exif data [19:40:03] quite a few ms-be1xxx's have high load [19:42:16] those warnings in hhvm.log have gone now [19:53:53] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:54:49] Now go figure whether puppet caused the overload or was delayed by an overload. [19:55:17] ms-be1003 has a nice CPU graph. perfect shade of red and pink with a small dot of blue :) [19:57:06] Nemo_bis: previous run failed so this is likely a successful run. still masive CPU and load on the server. network is still incredibly low compared to its average too [19:57:35] It increased to a whopping 100 KB/s for a moment! [20:14:30] 10Ops-Access-Requests, 6operations: Requesting access to hadoop / hive (analytics-privatedata-users) for Addshore - https://phabricator.wikimedia.org/T111204#1612726 (10Addshore) @RobH I have already done the L3 thing! Also yes, I have already signed an NDA, I am in the ldap group and also the phab group. @De... [20:17:14] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [20:18:55] (03PS1) 10Alex Monk: Remove obsolete comment about apache-config [puppet] - 10https://gerrit.wikimedia.org/r/236485 [20:24:55] (03PS2) 10ArielGlenn: dumps: be able to specify number of chunks for abstracts [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/235704 [20:26:10] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: be able to specify number of chunks for abstracts [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/235704 (owner: 10ArielGlenn) [20:38:26] (03PS1) 10ArielGlenn: dumps: big wikis do abstracts in 4 chunks, no need for page ranges [puppet] - 10https://gerrit.wikimedia.org/r/236487 [20:39:27] (03CR) 10ArielGlenn: [C: 032] dumps: big wikis do abstracts in 4 chunks, no need for page ranges [puppet] - 10https://gerrit.wikimedia.org/r/236487 (owner: 10ArielGlenn) [20:41:14] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:45:15] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:50:02] 6operations, 10RESTBase-Cassandra: cassandra - enable Inter-node encryption - https://phabricator.wikimedia.org/T94132#1612761 (10faidon) [20:50:04] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1612762 (10faidon) [20:50:28] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1535288 (10faidon) [20:50:29] 6operations, 10RESTBase, 10RESTBase-Cassandra: secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#1612764 (10faidon) [20:50:32] 6operations, 10RESTBase-Cassandra: cassandra - enable Inter-node encryption - https://phabricator.wikimedia.org/T94132#1155942 (10faidon) [20:53:35] 6operations, 10Analytics-Cluster, 10Traffic: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1612768 (10faidon) Kafka upstream is getting TLS/SSL support pretty soon. Last rumour I heard is 0.8.3 (due in October) for Kafka proper. librdkafka already [[ https... [20:58:14] 6operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#1612769 (10faidon) 3NEW [20:59:28] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1612777 (10faidon) [20:59:30] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580#1612778 (10faidon) [20:59:31] 6operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#1612776 (10faidon) [21:02:42] 6operations, 7Database: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1612781 (10faidon) 3NEW [21:26:07] 6operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#1612806 (10Platonides) [21:26:46] Platonides: I was about to add those too, thanks :) [21:27:18] :) [21:27:38] I like that task you made, paravoid [21:28:09] :) [21:28:27] I foresee lots of blocking tasks, though [21:36:53] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [500.0] [21:41:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [21:45:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [21:49:58] paravoid: are you still here? [21:57:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [5000000.0] [21:58:44] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:59:04] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [22:28:08] 6operations: ms-be1003 has irregular metrics on Ganglia - https://phabricator.wikimedia.org/T111658#1612838 (10JohnLewis) 3NEW [22:29:07] 10Ops-Access-Requests, 6operations: Requesting access to hadoop / hive (analytics-privatedata-users) for Addshore - https://phabricator.wikimedia.org/T111204#1612845 (10Deskana) >>! In T111204#1612726, @Addshore wrote: > This is for analysis of api usage of wikidata. Approved, on that basis. [22:47:03] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [23:17:50] ori, so we have both ircecho and tcpircbot and they both use SingleServerIRCBot subclasses [23:18:09] Or, and two copies of udpmxircecho.py.erb which also has one [23:18:16] Oh* [23:19:37] ircecho doesn't support various things - custom server ports, SSL, and NickServ login (T48254) [23:20:20] 6operations: ircecho should support nickserv registration - https://phabricator.wikimedia.org/T48254#1612939 (10Krenair) [23:21:11] but tcpircbot does [23:28:21] it seems silly to implement the same thing 2 or 3 times [23:29:14] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2003_v6 [23:29:29] Krenair: with you so far [23:29:47] ircecho was buggy [23:29:50] and crashed all the time [23:30:05] It's still in use :/ [23:30:14] Maybe it can be replaced with a simple script to send to a tcpircbot [23:30:42] is ircecho what we use to echo recent changes to irc? [23:31:08] mw-rc-irc, yes [23:31:24] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [23:31:59] I can add a command-line flag to tcpircbot that makes it read from stdin rather than a tcp socket [23:32:07] it relies on one of the copies of udpmxircecho.py.erb [23:32:53] that sounds like a better idea [23:33:07] although then it's not just tcpircbot, but okay [23:33:17] SyntaxHighlight_TCP [23:33:48] were you interested in doing this? [23:33:53] if so I can offer CR, etc. [23:34:40] I'll have a go [23:35:01] Thanks, I'll add you to the reviewer list when it gets to gerrit [23:42:42] ori, unrelated thing - for https://gerrit.wikimedia.org/r/#/c/232675/3 it just occurred to me that I should just be able to include MWWikiversions.php and call those functions? [23:43:03] rather than reimplement then in the puppet repo [23:44:21] them* [23:46:01] it's all in the scap module [23:49:48] yeah [23:49:50] that's true [23:54:52] mwscriptwikiset also needs to be merged with foreachwikiindblist somehow [23:55:09] but that's another commit