[07:40:52] (03PS1) 10BBlack: enable OCSP for misc cluster certs T97506 [puppet] - 10https://gerrit.wikimedia.org/r/212257 [07:42:18] (03CR) 10BBlack: [C: 032] enable OCSP for misc cluster certs T97506 [puppet] - 10https://gerrit.wikimedia.org/r/212257 (owner: 10BBlack) [07:48:51] is someone on a mgmt connection right now? [07:49:04] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Use OCSP Stapling on misc cluster - https://phabricator.wikimedia.org/T97506#1298522 (10BBlack) 5Open>3Resolved a:3BBlack Planet cert worked fine (their update lifetime is much longer than GS's), so I went and head and turned this on. Seems to be wo... [07:49:16] (it is probably sean) [07:51:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This is very nice and will surely help in the future, I have a few nitpicks." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212187 (owner: 10Ori.livneh) [07:51:58] <_joe_> jynus: need help? [07:52:11] <_joe_> if so, just ping me [07:52:29] _joe_, not really, just a busy mgmt connection [07:52:56] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Package a modern version of etcd for jessie, trusty - https://phabricator.wikimedia.org/T97970#1298536 (10Joe) [07:53:11] I can reboot it without it [07:53:48] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Create a tool to sync static configuration from a repository to the consistent k/v store - https://phabricator.wikimedia.org/T97978#1298537 (10Joe) [07:53:55] <_joe_> jynus: you can reset the console [07:54:11] <_joe_> jynus: do it, I wouldn't work without a functioning mgmt connection [08:00:38] (03CR) 10Physikerwelt: "@GWicke. Good. I can test the rendering, but not the configuration i.e. how does service::node tell the different localtion of the node_mo" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [08:15:24] 6operations, 10Traffic, 7discovery-system, 5services-tooling: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1298559 (10Joe) [08:16:08] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Figure out an etcd deploy strategy that includes multi DC failure scenarios. - https://phabricator.wikimedia.org/T98165#1298560 (10Joe) [08:17:10] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1298563 (10BBlack) p:5Triage>3Normal [08:17:28] 6operations, 10Beta-Cluster, 10Deployment-Systems, 10Traffic: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1298566 (10BBlack) p:5Triage>3Normal [08:17:52] 6operations, 10Traffic: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1298569 (10BBlack) p:5Triage>3Normal [08:19:44] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Integrate confd into the varnish configuration to generate the list of active backends - https://phabricator.wikimedia.org/T97975#1298571 (10Joe) [08:20:00] 7Puppet, 6operations, 10Traffic, 7discovery-system, 5services-tooling: Create a confd puppet module - https://phabricator.wikimedia.org/T97974#1298572 (10Joe) [08:20:43] 6operations, 10Traffic, 5Patch-For-Review, 7discovery-system, 5services-tooling: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1298573 (10Joe) [08:23:02] 6operations, 10Traffic, 7discovery-system: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1298574 (10Joe) As ACLs are coming in etcd 2.1, for now I'd just use SSL connections. The other option we have is to put nginx in front of etcd. [08:23:15] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1298575 (10Joe) [08:28:27] 6operations, 7discovery-system, 5services-tooling: Create a debian package of python-etcd - https://phabricator.wikimedia.org/T99771#1298576 (10Joe) 3NEW [08:30:27] PROBLEM - puppet last run on wtp2015 is CRITICAL puppet fail [08:31:54] (03PS3) 10Giuseppe Lavagetto: Disable RESTBase restarts by Puppet on config change [puppet] - 10https://gerrit.wikimedia.org/r/212016 (owner: 10Mobrovac) [08:32:02] <_joe_> mobrovac: can I merge ^^ [08:32:09] <_joe_> I was about to suggest this yesterday [08:32:18] damn, was in the process of changing that [08:32:22] <_joe_> puppet is _not_ a coordination framework [08:32:25] <_joe_> oh sorry [08:32:31] <_joe_> what do you want to change? [08:32:48] _joe_: i'd like to add some requires to be sure the first install goes smoothly [08:32:50] makes sense? [08:32:57] (for news clusters) [08:33:18] <_joe_> mobrovac: require instead of notify, yeah makes sense [08:33:20] <_joe_> go on [08:33:35] ah you just rebased the change [08:33:45] cool, that means i can continue what i was doing :) [08:36:05] <_joe_> yes [08:36:59] (03PS4) 10Mobrovac: Disable RESTBase restarts by Puppet on config change [puppet] - 10https://gerrit.wikimedia.org/r/212016 [08:37:04] _joe_: there ^^ [08:37:37] (I also added this to https://wikitech.wikimedia.org/wiki/Incident_documentation/20150519-RESTBase#Actionables ) [08:38:13] (03CR) 10Filippo Giunchedi: [C: 04-1] Slight refactor for varnishstatus.py (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [08:38:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "remove a redundant require and it LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/212016 (owner: 10Mobrovac) [08:39:19] _joe_: oh puppet is smarter than i gave it credit for :D [08:39:39] <_joe_> mobrovac: it's actually dumber than you imagine, in general [08:39:59] hence my smile at the end of the statement [08:40:11] 6operations, 7discovery-system, 5services-tooling: Create a debian package of python-etcd - https://phabricator.wikimedia.org/T99771#1298585 (10fgiunchedi) a:3fgiunchedi I'll take a stab at this, do we expect precise support from the beginning? [08:40:13] (03PS5) 10Mobrovac: Disable RESTBase restarts by Puppet on config change [puppet] - 10https://gerrit.wikimedia.org/r/212016 [08:40:36] (03CR) 10Mobrovac: Disable RESTBase restarts by Puppet on config change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/212016 (owner: 10Mobrovac) [08:40:54] 6operations, 7discovery-system, 5services-tooling: Create a debian package of python-etcd - https://phabricator.wikimedia.org/T99771#1298587 (10Joe) that would be great since the puppetmasters have precise, but we can live without it I guess. [08:46:07] RECOVERY - puppet last run on wtp2015 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [08:51:40] (03PS1) 10Matanya: mailman: restart mailman after config changes [puppet] - 10https://gerrit.wikimedia.org/r/212260 [08:52:54] (03CR) 10Giuseppe Lavagetto: [C: 032] Disable RESTBase restarts by Puppet on config change [puppet] - 10https://gerrit.wikimedia.org/r/212016 (owner: 10Mobrovac) [09:09:27] RECOVERY - Disk space on graphite2001 is OK: DISK OK [09:30:16] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 604 [09:34:00] <_joe_> mmh grrrit-wm not working [09:34:19] <_joe_> Coren: are tools ok? I kind of remember it working from toollabs [09:35:00] Hm. There's nothing wrong systemwide that I can see. Lemme see if there's something obviously wrong with grrit itself. [09:35:16] RECOVERY - check_mysql on db1008 is OK: Uptime: 2926176 Threads: 1 Questions: 9288246 Slow queries: 19429 Opens: 48156 Flush tables: 2 Open tables: 64 Queries per second avg: 3.174 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:35:18] !log jynus Synchronized wmf-config/db-eqiad.php: repool db1063 (duration: 00m 12s) [09:35:28] Logged the message, Master [09:37:32] <_joe_> godog, akosiaris I'm merging the etcd module, we can make it better later on if we think we need it [09:37:53] !log tools kicked grrrit-wm in the diodes. [09:37:57] Logged the message, Master [09:40:17] _joe_: There seems to be something with grrrit-wm itself; its log seems to be full of what looks like json parse errors. [09:40:30] _joe_: ok [09:40:51] It's not very clear though; just a bunch of stack traces. [09:43:29] <_joe_> !log stopping puppet, fiddling with HHVM parameters on mw1114 [09:43:37] Logged the message, Master [09:48:17] _joe_: I've restarted it entirely; it may be happy now. [09:48:38] <_joe_> Coren: let's test it :) [09:49:27] (03PS1) 10Giuseppe Lavagetto: hhvm: adjust TC size and ratios [puppet] - 10https://gerrit.wikimedia.org/r/212266 [09:54:25] <_joe_> Coren: thanks :) [10:03:29] (03PS2) 10Giuseppe Lavagetto: hhvm: adjust TC size and ratios [puppet] - 10https://gerrit.wikimedia.org/r/212266 [10:04:49] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: adjust TC size and ratios [puppet] - 10https://gerrit.wikimedia.org/r/212266 (owner: 10Giuseppe Lavagetto) [10:10:40] 6operations, 10Traffic, 5Patch-For-Review, 7discovery-system, 5services-tooling: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1298666 (10Joe) As decided with @akosiaris, etcd will be installed on VMs on the eqiad ganeti cluster for now. [10:10:47] 6operations, 10Traffic, 7discovery-system, 5services-tooling: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1298669 (10Joe) [10:10:50] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Create a tool to sync static configuration from a repository to the consistent k/v store - https://phabricator.wikimedia.org/T97978#1298670 (10Joe) [10:18:56] !log jynus Synchronized wmf-config/db-eqiad.php: repool db1063 (duration: 00m 11s) [10:19:01] Logged the message, Master [10:28:59] 6operations, 10Deployment-Systems: Unhashable type: dict error when running salt --batch-size - https://phabricator.wikimedia.org/T99776#1298692 (10ArielGlenn) 3NEW a:3ArielGlenn [10:29:15] 6operations, 10Deployment-Systems: Unhashable type: dict error when running salt --batch-size - https://phabricator.wikimedia.org/T99776#1298700 (10ArielGlenn) p:5Triage>3Normal [10:29:19] 6operations, 10Deployment-Systems: Unhashable type: dict error when running salt --batch-size - https://phabricator.wikimedia.org/T99776#1298692 (10ArielGlenn) Looks similar: https://github.com/saltstack/salt/issues/23047 [10:37:40] (03PS1) 10Dereckson: New throttle rule: * Event name .... Festival delle Libertà digitali @ Festambiente Vicenza 2015 * Event start ... 2015-06-26T13:00 +0:00 * Event end ..... 2015-06-27T23:59 +0:00 * IP ............ (all) * Projects ...... it.wikivoyage * Attendees ..... 50 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212271 (https://phabricator.wikimedia.org/T99772) [10:38:47] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [10:39:14] 6operations, 10Deployment-Systems: Unhashable type: dict error when running salt --batch-size - https://phabricator.wikimedia.org/T99776#1298707 (10ArielGlenn) 2015-05-20 10:14:40,401 [salt.log.setup ][ERROR ] An un-handled exception was caught by salt's global exception handler: TypeError: unhashable type... [10:40:03] (03PS2) 10Dereckson: Festival delle Libertà digitali throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212271 (https://phabricator.wikimedia.org/T99772) [10:40:44] 6operations, 10Deployment-Systems: Unhashable type: dict error when running salt --batch-size - https://phabricator.wikimedia.org/T99776#1298709 (10ArielGlenn) more log: 2015-05-20 10:14:40,398 [salt.utils.minions][ERROR ] Failed matching available minions with list pattern: [{'mw1190.eqiad.wmnet': {'ret':... [10:49:45] (03PS8) 10Giuseppe Lavagetto: For cert names, use the fqdn instead of the ec2id if use_dnsmasq is lowered. [puppet] - 10https://gerrit.wikimedia.org/r/202924 (owner: 10Andrew Bogott) [10:52:05] 6operations, 5Patch-For-Review, 7database: On a maintenance window, upgrade db1063 to 14.04 and its MariaDB package to 10.0.16 - https://phabricator.wikimedia.org/T99520#1298715 (10jcrespo) The node has been repooled after an upgrade to 14.04 and MariaDB 10.0.16, but the machine was not restarted (it is curr... [10:52:31] 6operations, 5Patch-For-Review, 7database: On a maintenance window, upgrade db1063 to 14.04 and its MariaDB package to 10.0.16 - https://phabricator.wikimedia.org/T99520#1298716 (10jcrespo) p:5Triage>3Low [10:57:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [10:58:58] PROBLEM - puppet last run on mw1123 is CRITICAL puppet fail [11:10:52] 6operations: mailman: centralize logging or create a mailman admin group - https://phabricator.wikimedia.org/T99734#1298738 (10fgiunchedi) only slightly related to the general investigation, but looks like the checks on mailman queue should include all queues? (IOW should an alarm for 'too many messages in moder... [11:15:38] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 4 below the confidence bounds [11:18:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [11:20:24] <_joe_> mh [11:29:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:34:54] 7Blocked-on-Operations, 6Commons, 10Wikimedia-Site-requests: Add *.wmflabs.org to `wgCopyUploadsDomains` - https://phabricator.wikimedia.org/T78167#1298782 (10Steinsplitter) >>! In T78167#1297229, @Krenair wrote: > You realise this task has an open blocker, right? Somone schould work on the "blocker".... [11:35:27] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1298783 (10Steinsplitter) p:5Normal>3High [11:35:53] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1198268 (10Steinsplitter) Changing priority to "high". This is blocking quite a lot... [11:39:58] (03CR) 10Physikerwelt: [C: 031] "I tested I767dd2720386221891f25b5c692b766b4708be15 on http://math-preview.wmflabs.org/ using $wgMathMathMLUrl = "http://mathoid2.eqiad.wmf" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [11:47:27] RECOVERY - puppet last run on mw1123 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:11:17] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [12:42:30] 6operations, 10Deployment-Systems: Unhashable type: dict error when running salt --batch-size - https://phabricator.wikimedia.org/T99776#1298827 (10ArielGlenn) I have tested a manual fix in deployment-prep which seems ok. I have commented with this fix and the explanation on the upstream bug and will be naggi... [12:59:37] PROBLEM - Disk space on analytics1017 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/b 73495 MB (3% inode=99%): /var/lib/hadoop/data/c 82103 MB (4% inode=99%): /var/lib/hadoop/data/d 80241 MB (4% inode=99%): /var/lib/hadoop/data/e 80410 MB (4% inode=99%): /var/lib/hadoop/data/f 83324 MB (4% inode=99%): /var/lib/hadoop/data/g 83779 MB (4% inode=99%): /var/lib/hadoop/data/h 82996 MB (4% inode=99%): /var/lib/hadoop/data/i [12:59:49] hmmmMMM [13:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150520T1300). [13:11:18] (03PS1) 10Springle: repool db1045; depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212280 [13:11:56] (03CR) 10Springle: [C: 032] repool db1045; depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212280 (owner: 10Springle) [13:12:02] (03Merged) 10jenkins-bot: repool db1045; depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212280 (owner: 10Springle) [13:12:04] (03PS1) 10KartikMistry: CX: Enable Content Translation for 20150521 planned wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212281 (https://phabricator.wikimedia.org/T98741) [13:13:07] !log springle Synchronized wmf-config/db-eqiad.php: repool db1045; depool db1026 (duration: 00m 13s) [13:13:12] Logged the message, Master [13:28:33] (03PS1) 10Aude: Enable usage tracking on itwiki and wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212282 (https://phabricator.wikimedia.org/T98303) [13:35:08] RECOVERY - puppet last run on analytics1029 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [13:35:55] (03CR) 10Springle: [C: 031] Create new module for managing RAID settings [puppet] - 10https://gerrit.wikimedia.org/r/212027 (https://phabricator.wikimedia.org/T84178) (owner: 10Jcrespo) [13:37:43] (03CR) 10Ottomata: Slight refactor for varnishstatus.py (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [13:51:29] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: remove legacy resources [puppet] - 10https://gerrit.wikimedia.org/r/209262 (owner: 10Alexandros Kosiaris) [13:59:04] (03PS1) 10Ottomata: Rotate kafkatee generated logs daily [puppet] - 10https://gerrit.wikimedia.org/r/212285 [14:00:05] chasemp: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150520T1400). [14:01:07] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 3 failures [14:01:14] (03PS2) 10Ottomata: Rotate kafktee generated log daily and compress [puppet] - 10https://gerrit.wikimedia.org/r/212285 [14:02:16] a bit of an insert spike on s2 shard right now [14:08:20] (03CR) 10Ottomata: [C: 032] Rotate kafktee generated log daily and compress [puppet] - 10https://gerrit.wikimedia.org/r/212285 (owner: 10Ottomata) [14:08:35] (03CR) 10Filippo Giunchedi: Slight refactor for varnishstatus.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [14:08:36] akosiaris: ok if i merge you rhting? [14:08:44] puppetmaster: cleanups in gitsync (d4ac02c) [14:09:30] (03PS1) 10Ottomata: Compess kafkatee generated logs in logrotate [puppet] - 10https://gerrit.wikimedia.org/r/212286 [14:09:45] ottomata: yep varnishncsa should be fine [14:09:46] (03CR) 10Ottomata: [C: 032 V: 032] Compess kafkatee generated logs in logrotate [puppet] - 10https://gerrit.wikimedia.org/r/212286 (owner: 10Ottomata) [14:10:44] godog: cool [14:10:49] akosiaris: merged. [14:13:37] (03PS1) 10Ottomata: Fix kafkatee logrotate olddir [puppet] - 10https://gerrit.wikimedia.org/r/212288 [14:13:51] (03CR) 10Ottomata: [C: 032 V: 032] Fix kafkatee logrotate olddir [puppet] - 10https://gerrit.wikimedia.org/r/212288 (owner: 10Ottomata) [14:14:38] (03PS1) 10Ottomata: Fix stderr redirection for hdfs balancer log [puppet] - 10https://gerrit.wikimedia.org/r/212289 [14:15:21] (03PS2) 10Ottomata: Fix stderr redirection for hdfs balancer log [puppet] - 10https://gerrit.wikimedia.org/r/212289 [14:15:27] (03CR) 10Ottomata: [C: 032 V: 032] Fix stderr redirection for hdfs balancer log [puppet] - 10https://gerrit.wikimedia.org/r/212289 (owner: 10Ottomata) [14:17:16] (03PS1) 10Ottomata: Send refinery dump emails to Joseph as well [puppet] - 10https://gerrit.wikimedia.org/r/212290 [14:18:07] (03CR) 10Ottomata: [C: 032] Send refinery dump emails to Joseph as well [puppet] - 10https://gerrit.wikimedia.org/r/212290 (owner: 10Ottomata) [14:24:06] ottomata: ah yes, thanks. sorry [14:24:56] !log Deployed Event Logging Server with better batch insertion on Monday, May 18 (apologies for late notice) [14:25:05] Logged the message, Master [14:25:26] np [14:29:58] (03PS2) 10Alexandros Kosiaris: puppetmaster: latest to present [puppet] - 10https://gerrit.wikimedia.org/r/209269 [14:30:00] (03PS2) 10Alexandros Kosiaris: puppetmaster::gitpuppet lint cleanups [puppet] - 10https://gerrit.wikimedia.org/r/209268 [14:30:02] (03PS2) 10Alexandros Kosiaris: puppetmaster: Move backups to the role class [puppet] - 10https://gerrit.wikimedia.org/r/209271 [14:30:04] (03PS2) 10Alexandros Kosiaris: puppetmaster::logstash. Avoid out of module dependencies [puppet] - 10https://gerrit.wikimedia.org/r/209270 [14:30:06] (03PS2) 10Alexandros Kosiaris: puppetmaster::reporter::logstash. Remove the reporter namespace [puppet] - 10https://gerrit.wikimedia.org/r/209265 [14:30:08] (03PS2) 10Alexandros Kosiaris: puppetmaster: remove extraneous empty line [puppet] - 10https://gerrit.wikimedia.org/r/209264 [14:30:10] (03PS2) 10Alexandros Kosiaris: puppetmaster::config Avoid out of module dependencies [puppet] - 10https://gerrit.wikimedia.org/r/209267 [14:30:12] (03PS2) 10Alexandros Kosiaris: puppetmaster::config. Minor lints [puppet] - 10https://gerrit.wikimedia.org/r/209266 [14:30:14] (03PS2) 10Alexandros Kosiaris: puppetmaster: DRY on inclusion of hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/209272 [14:35:31] (03CR) 10Aude: [C: 032] Enable usage tracking on itwiki and wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212282 (https://phabricator.wikimedia.org/T98303) (owner: 10Aude) [14:35:36] (03Merged) 10jenkins-bot: Enable usage tracking on itwiki and wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212282 (https://phabricator.wikimedia.org/T98303) (owner: 10Aude) [14:36:43] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable usage tracking on itwiki and wikiquote (duration: 00m 16s) [14:36:48] Logged the message, Master [14:37:55] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster::reporter::logstash. Remove the reporter namespace [puppet] - 10https://gerrit.wikimedia.org/r/209265 (owner: 10Alexandros Kosiaris) [14:44:21] (03PS1) 10ArielGlenn: git deploy: don't fetch/checkout/restart on the deployment server [puppet] - 10https://gerrit.wikimedia.org/r/212291 [14:45:14] (03CR) 10jenkins-bot: [V: 04-1] git deploy: don't fetch/checkout/restart on the deployment server [puppet] - 10https://gerrit.wikimedia.org/r/212291 (owner: 10ArielGlenn) [14:47:40] (03PS2) 10ArielGlenn: git deploy: don't fetch/checkout/restart on the deployment server [puppet] - 10https://gerrit.wikimedia.org/r/212291 [14:48:03] come on. 80 characters instead of 79? seriously? [14:48:50] (03CR) 10ArielGlenn: "untested so don't blindly commit, please. thank you." [puppet] - 10https://gerrit.wikimedia.org/r/212291 (owner: 10ArielGlenn) [14:49:24] (03CR) 10ArielGlenn: "https://gerrit.wikimedia.org/r/#/c/212291/ maybe like this. see https://phabricator.wikimedia.org/T67549" [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) (owner: 10BryanDavis) [14:50:03] * AndyRussG waves at #wikimedia-operations [14:50:25] <_joe_> AndyRussG: need something? [14:50:49] hi _joe_ ... indeed I do... ;p [14:51:22] _joe_: https://gerrit.wikimedia.org/r/#/c/202925/ , hoping to get this merged and deployed, but it needs a bit of oversight from ops folks :) [14:51:41] _joe_: thanks 4 asking :) [14:52:39] <_joe_> AndyRussG: being this in mediawiki-core, and not in mediawiki-config, I doubt we can even +2 this :) [14:52:46] Basically it's a core change that changes the signature of a method that's overridden in several extensions [14:53:04] _joe_: well, basically I was hoping for some direction on a deploy strategy [14:53:35] My impression it that we should make sure everything is +2'ed at once, so that first on the beta cluster the method sig and its overrides are all coordinated [14:53:58] <_joe_> AndyRussG: AFAIK we use multiversion everywhere [14:54:06] And then also do a special deploy to roll the core and extension patches all at once [14:54:20] <_joe_> oh you mean for the extensions [14:54:36] <_joe_> I don't really know how we do manage extensions deployments [14:54:36] _joe_: yea. The extensions override the method that changes [14:54:52] I think it varies by extension [14:54:59] <_joe_> you should ask the release team for that! [14:55:06] looks scary [14:55:09] <_joe_> but they're at an offsite pre-hackathon [14:55:17] <_joe_> yeah looks pretty scary, but manageable [14:55:18] hi aude :) [14:55:40] <_joe_> AndyRussG: so sorry, can't really help here. [14:55:49] I'm not sure it would actually break anything, but it might fill up the log with ugly warnings [14:55:51] * aude would probably ask krinkle [14:56:06] but doubt he is around today [14:56:28] aude: he does appear to be in his un-around state ;p [14:56:32] anyone else you can think of? [14:56:36] no [14:56:37] Good morning. I've just added 3 config patches for the morning SWAT. Sorry for the last minute addition. [14:57:10] Yeah I guess some folks from release would also want to opine [14:57:24] aude: hmm K thanks in any case :) [14:57:26] godog: did you have a suggested graphite/statsd prefix for this reqstats stuff? [14:57:32] sure. good luck :) [14:57:50] thx.... [14:57:58] <_joe_> AndyRussG: definitely sync with core devs and the release team [14:58:19] _joe_: yep ... :) [14:59:52] ottomata: if we're going the diamond route it'll be under servers. so perhaps just varnish. ? [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, matt_flaschen: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150520T1500). Please do the needful. [15:00:41] AndyRussG, should be fine if you just scap? May want to do it in a window. [15:00:43] well, i thikn we want things like cache type, ja? also, ummm [15:00:49] maybe request.? [15:00:50] so [15:01:29] varnish...requests.? [15:01:29] so [15:01:32] matt_flaschen: thanks! hmm I've never scap'ed, only sync-dir'd [15:02:02] servers.cp1052.varnish.text.frontend.requests.5xx [15:02:02] ? [15:02:49] matt_flaschen: I guess my first concern is the beta cluster when it gets +2'd... Maybe I need to find out who is responsible for coding and deploying on all those extensions, and set a proposed timetable for it all [15:03:38] <_joe_> ottomata: that sounds correct, although I'd add the datacenter somewhere [15:03:52] ook, before cache type then [15:04:03] <_joe_> so servers.cp1052.varnish.eqiad.text... [15:04:07] <_joe_> something like this [15:04:09] AndyRussG, I think the Beta cluster pulls every x minutes (someone should make sure https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#How_do_I_get_my_code_on_the_beta_cluster.3F is updated). [15:04:20] <_joe_> ottomata: but wait for filippo [15:04:34] <_joe_> he surely has a firmer grasp on our current graphite installation [15:04:39] AndyRussG, anyway, you wouldn't be the first one to break Beta for a few minutes. Does it actually fatal, or just warn, if there's a mismatch? [15:04:41] aye he will review :) [15:05:12] matt_flaschen: I think it depends on config, truth is I'm not sure how it'll go [15:05:45] WRT beta, I was wondering if that's the case for all extensions [15:06:10] !log db1045 pt-osc reindexing (should be low load, ~2hr) [15:06:17] Logged the message, Master [15:06:35] AndyRussG, if you merge the extension changes first, it should work fine since it's an optional parameter, right? [15:06:43] Then let that roll out to Beta, then merge core. [15:07:27] Probably can be made non-optional after core change is merged. [15:07:44] AndyRussG, you probably don't even need a window unless you need this feature out Real Soon Now. Otherwise, you can wait for branch cut. [15:07:46] matt_flaschen: hmmm yeah that makes sense... though even in that case there may be some warnings due to just the definition of an overridden method with a different sig than the original [15:08:24] matt_flaschen: by a window, do you mean booking a deploy window on the wikitech deployments page? [15:08:35] Yes [15:09:25] hmmm I didn't know it was OK to go and deploy stuff w/out one... [15:09:35] uhh - anyone swating now? [15:09:39] I haven't been paying attention [15:09:58] manybubbles: not as far as I know [15:10:11] Dereckson: ok then. I guess I'll do it! [15:10:11] ottomata: do different types live on the same machine? [15:10:33] I haven't been following the emails about reorganizing it - so I'm just going to do what I normally do. like from the wiki page. any objections? [15:10:47] matt_flaschen: around to check your swats? [15:10:51] Yep [15:10:59] hmmm! [15:11:14] AndyRussG, I wasn't saying to do a separate deploy without the window. [15:11:22] I was saying that it might work to use the deployment train. [15:11:31] ah right understood ;p [15:12:05] godog: not that I know of, but it would be nice to aggregate by type, no? [15:12:26] AndyRussG, I think it is fine to override it if the number of required parameters is the same: [15:12:30] (03CR) 10Manybubbles: [C: 032] Logo configuration on ur.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211514 (https://phabricator.wikimedia.org/T97510) (owner: 10Dereckson) [15:12:38] (03Merged) 10jenkins-bot: Logo configuration on ur.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211514 (https://phabricator.wikimedia.org/T97510) (owner: 10Dereckson) [15:12:40] "For example, if the child class defines an optional argument, where the abstract method's signature does not, there is no conflict in the signature. This also applies to constructors as of PHP 5.4. Before 5.4 constructor signatures could differ." [15:12:42] https://php.net/manual/en/language.oop5.abstract.php [15:12:45] matt_flaschen: yeah that makes sense. Deployment train would get it there soon enough. I guess it's just a question of making sure all all the needed extensions are updated on that train [15:13:17] AndyRussG, yes. Do you know how to grep through all extensions on the cluster? [15:13:28] cmjohnson1, ping [15:13:50] ottomata: yep it'd be nice, let's start with type and name and see what happens [15:13:51] !log manybubbles Synchronized w/static/images/project-logos/urwikiquote.png: SWAT update urwikiquote logo 1/2 (duration: 00m 13s) [15:13:53] Testing. [15:13:58] Logged the message, Master [15:14:00] oh well 1/2 [15:14:06] matt_flaschen: directly on the cluster? Mmm what I've done is pulled in all the submodules on a local core repository, on some deploy branches, and grepped locally [15:14:11] <_joe_> AndyRussG: I think you make things a bit too easy [15:14:17] AndyRussG, no, not directly. [15:14:22] That's what I meant. [15:14:30] <_joe_> first of all I don't see anyone who gave even +1 to your patch [15:14:36] <_joe_> so don't rush to the deploy [15:14:39] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT update urwikiquote logo 2/2 (duration: 00m 11s) [15:14:46] Logged the message, Master [15:14:57] <_joe_> I suppose such a breaking change will create some debate [15:15:16] Works. [15:15:19] matt_flaschen: ah yes :) did that. I did hear that that'll get all the extensions we have on the cluster [15:15:57] _joe_, he's not trying to deploy it now, just discussing issues. Sounds like deployment train would be fine, so wouldn't even need a separate window. [15:16:18] _joe_: Krinkle|detached signed off on the change on principle in one of the early patchsets. The extensive commit message was on ori's suggestion and thedj recommended the change to in release notes [15:16:48] (03CR) 10Manybubbles: "Done. Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211514 (https://phabricator.wikimedia.org/T97510) (owner: 10Dereckson) [15:17:52] (03CR) 10Manybubbles: [C: 032] Festival delle Libertà digitali throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212271 (https://phabricator.wikimedia.org/T99772) (owner: 10Dereckson) [15:17:59] (03Merged) 10jenkins-bot: Festival delle Libertà digitali throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212271 (https://phabricator.wikimedia.org/T99772) (owner: 10Dereckson) [15:18:05] matt_flaschen: _joe_: There's also a phab task BTW: https://phabricator.wikimedia.org/T98924 [15:18:14] Ah hmm missing bug field in commit message [15:18:55] !log manybubbles Synchronized wmf-config/throttle.php: SWAT clean old throttle rule and add a new one for an upcoming festival (duration: 00m 13s) [15:19:01] Logged the message, Master [15:19:16] Dereckson: ^^^ I'm not sure there is a way to test that one. I already tested the wikiquote log change. [15:19:27] matt_flaschen: thx for that tidbit of phpdoc, that's good to see [15:19:35] I concour, throttle rules changes aren't really testable. [15:20:14] (03CR) 10Manybubbles: [C: 032] Namespace configuration on pt.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211517 (https://phabricator.wikimedia.org/T94894) (owner: 10Dereckson) [15:20:22] (03Merged) 10jenkins-bot: Namespace configuration on pt.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211517 (https://phabricator.wikimedia.org/T94894) (owner: 10Dereckson) [15:21:58] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT new namespaces for ptwikinews (duration: 00m 11s) [15:22:00] Dereckson: last one ^^^^ [15:22:04] Logged the message, Master [15:22:06] matt_flaschen: _joe_: how does this plan sound? 1) make sure enough core devs and implicated extension devs are reviewers/subscribed to the changes and the phabricator task. 2) Set a proposed date for merging everything, ask for any objections and final comments before that date. 3) Talk to release engineering to make sure it all gets on the same train [15:22:09] How does that sound? [15:22:18] manybubbles: Works too. [15:22:31] Dereckson: swat over for you. good times [15:22:38] matt_flaschen: merge time for your patches [15:22:52] Thanks manybubbles for the deploy. [15:24:10] AndyRussG, it sounds fine. Probably more cautious than it needs to be given the patch has already got a pretty wide viewing and +1 from Krinkle. You do need to make sure you caught all the WMF-deployed extensions, of course. [15:24:29] AndyRussG, also, if it is a breaking change for extensions that still have the no-arg override, it should be documented as such in the release notes. [15:26:08] hmmm matt_flaschen: currently in rel notes: "* Added an optional ResouceLoaderContext parameter to ResourceLoaderModule::getDependencies(). Extension classes that override that method should be updated." [15:26:38] AndyRussG, if it will actually break extensions that don't update, it should be marked BREAKING CHANGE. [15:27:02] matt_flaschen: hmm, so I guess, find out if it's really breaking or just warning-generation level of breaking [15:27:20] or is it that considered just as breaking as anything else? [15:27:27] RECOVERY - Disk space on analytics1017 is OK: DISK OK [15:28:10] Probably don't need BREAKING CHANGE if it just triggers a warning. [15:28:24] (03PS5) 10Mobrovac: Enable graphoid in labs & production [puppet] - 10https://gerrit.wikimedia.org/r/211758 (owner: 10GWicke) [15:28:31] (03CR) 10Mobrovac: [C: 031] Enable graphoid in labs & production [puppet] - 10https://gerrit.wikimedia.org/r/211758 (owner: 10GWicke) [15:29:32] matt_flaschen: K I'll find that out then...! Thanks much BTW :D [15:29:40] No problem [15:29:49] _joe_: godog: would you mind reviewing/deploying https://gerrit.wikimedia.org/r/211758 [15:30:18] <_joe_> mobrovac: on it. [15:30:19] no Cassandra changes this time [15:30:21] <_joe_> mobrovac: not right now [15:30:21] _joe_: thnx [15:30:37] <_joe_> mobrovac: do you need this /now/? [15:30:57] * _joe_ has just realized it's 5:30 PM [15:31:17] _joe_: preferably, but tomorrow morning might work as well [15:31:45] mobrovac _joe_ I'll take a look [15:31:54] <_joe_> mobrovac: that's a safer bet for me [15:32:02] <_joe_> godog: ok if you have time now, that's better [15:32:03] :) [15:32:07] sure [15:32:18] thnx godog [15:33:36] matt_flaschen: about to sync your flow patch. ready? [15:33:43] also, does it have i18n? [15:34:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Enable graphoid in labs & production [puppet] - 10https://gerrit.wikimedia.org/r/211758 (owner: 10GWicke) [15:34:35] mobrovac: merged [15:34:40] manybubbles, yes, and no. [15:34:42] yup, thnx godog [15:34:49] sync-dir is fine [15:35:14] mobrovac: I'm running puppet manually on restbase1001 btw [15:35:20] oh [15:35:28] ^C [15:35:29] :) [15:35:43] is stopped my run [15:35:45] !log manybubbles Synchronized php-1.26wmf6/extensions/Flow/: SWAT update flow for wmf6 to fix two issues (duration: 00m 12s) [15:35:48] s/is/i [15:35:50] Logged the message, Master [15:35:54] matt_flaschen: ^^^^^^ [15:35:54] heheh, well yeah it just deployed the new config and that's it [15:36:18] Thanks [15:37:03] matt_flaschen: let me know if that works and then I'll do the wmf5 [15:37:07] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:38:48] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1299128 (10Dzahn) now added in Apache, next would be installing the wiki http://cn.wikimedia.org/ [15:38:57] PROBLEM - puppet last run on restbase1001 is CRITICAL puppet fail [15:39:06] godog: euh ? ^^ [15:39:06] manybubbles, I can't really test it immediately. If someone had staff rights, they could visit https://www.mediawiki.org/wiki/Special:EnableFlow , but that still doesn't test much without actually submitting. [15:39:35] mobrovac: mh probably an artifact, it ran fine, I'm re-running it now [15:39:39] matt_flaschen: k. I'll just merge the next one - can you test it then? [15:39:44] yeah it'll recover mobrovac [15:39:51] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1299133 (10Dzahn) [15:40:14] manybubbles, not right away, but we'll probably use it later today. [15:40:14] <_joe_> I had one comment on that patch but I can easily submit a refactoring patch later [15:40:22] matt_flaschen: k. [15:40:47] RECOVERY - puppet last run on restbase1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:40:50] Deskana, can you visit https://www.mediawiki.org/wiki/Special:EnableFlow just to make sure it doesn't explode immediately (there's no reason it shouldmight, just a spot check)? [15:40:53] (03CR) 10Dzahn: [C: 031] Imported logo for Wikimedia User Group China [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211094 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [15:41:12] It should show a form. [15:41:25] matt_flaschen: Looks fine to me. Do you want me to put anything in the fields? [15:41:26] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1299138 (10ksmith) I could take a guess of what tables are needed, although I would probably be wrong, so we might need several iterations. The main ones would be Projects, Tasks, Co... [15:41:47] Deskana, no, that's okay. We'll use it for something real later today most likely. [15:42:01] godog: damn, do not restart RB on the nodes, please [15:42:38] (03CR) 10Dzahn: [C: 031] cn.wikimedia.org initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211103 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [15:42:49] mobrovac: sure, I won't touch it [15:42:52] godog: ah no, it's ok, was paranoid [15:43:00] godog: puppet ran on all of them? [15:43:41] mobrovac: no not yet, going to test on restbase1001 first [15:44:01] godog: test? what? can i restart RB there? the config is in place [15:44:18] mobrovac: I said that wrong, I didn't stop puppet on the other nodes, yes let's restart rb there [15:44:21] nice, i was looking for the docs to install new wikis and "List of things that will break if you try to install MediaWiki without following this procedure" :) [15:45:20] godog: rb1001 restarted and alive [15:45:41] " Running apt-get -y unattended without the --no-remove option: due to some subtle error in the repository or sources.list, apt-get declares your entire installation "conflicting", right down to glibc, and removes it, bricking the server." :p [15:46:02] where is this from? [15:46:25] Krenair: https://wikitech.wikimedia.org/wiki/Add_a_server [15:47:01] "on the ssh bastion host (fenari)" sigh [15:47:26] (03PS1) 10BryanDavis: [WIP] Add Vagrantfile [puppet] - 10https://gerrit.wikimedia.org/r/212294 [15:47:27] mobrovac: yup, I'll move on to the other [15:47:39] !log restbase restarted on restbase1001 [15:47:44] Logged the message, Master [15:47:46] Krenair: i wanted "Add a wiki" instead .. and yes @ fenari :p [15:49:10] oh, and the very first thing needs to be checked. so that's good [15:49:22] sanitarium.. ah ! "Tell the Ops list, springle, or Coren that this is happening so the sanitarium can be prepared." [15:50:07] would be better to explain what needs to be done in sanitarium to prepare [15:50:13] true [15:50:48] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1299166 (10mmodell) Maybe @csteipp has an opinion, I believe #security sensitive tasks are the only data that need to be redacted? A read-only db user might be easier, as long as th... [15:51:41] (03PS3) 10Giuseppe Lavagetto: confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) [15:51:59] !log restbase restarted on restbase1002 [15:52:04] Logged the message, Master [15:52:10] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1274358 (10Dzahn) [[ https://wikitech.wikimedia.org/wiki/Add_a_wiki | Add a wiki ]] says: "Tell the Ops list, springle, or Coren that t... [15:52:13] godog: ^^ rb1002 ok [15:52:52] mobrovac: yup, I can finish off the rest (ah ah) too [15:53:04] _joe_: yey for confd module [15:53:20] godog: you mean the restarts as well? [15:53:22] sure [15:53:26] yep [15:53:29] thnx [15:54:37] !log rolling restart restbase on restbase1003-1006 [15:54:45] Logged the message, Master [15:55:44] Coren: do you know how to prepare sanitarium for a new wiki? the docs say to tell you or springle or the ops list [15:56:24] mobrovac: how does rb know when do spawn new workers btw? [15:56:41] <_joe_> mobrovac: confd is a bit crude tbh, I know I can write something definitely better, given time to [15:57:00] * _joe_ off for now [15:57:06] godog: when they die or the master proc kills them due to too much memory [15:58:02] mobrovac: ah, what about the initial ramp up after start up? [15:58:12] godog: there is now support for sending SIGHUP to the master proc, and it restarts the workers in a rolling fashion [15:58:12] * gwicke cheers about seeing http://en.wikipedia.org/api/rest_v1/?doc#!/Page_content/page_graph_png__title___revision___graph_id__get [15:59:08] mobrovac: nice, btw I've been using service stop/start [15:59:14] godog: on start-up, they're brought up one by one sequentially [15:59:32] gwicke: nice [16:01:15] godog, re initial rampup: I wrote a script that a) waits for the port to become available, and b) then waits another 10 seconds to give LVS enough time to reconnect [16:02:04] the restart playbook in https://github.com/gwicke/ansible-playground/tree/master/roles/restbase [16:03:46] (03PS2) 10Dzahn: labs_vmbuilder: sort files/postinst.copy [puppet] - 10https://gerrit.wikimedia.org/r/210025 (owner: 10Hashar) [16:03:55] !log manybubbles Synchronized php-1.26wmf5/extensions/Flow/: SWAT update flow for wmf5 to fix two issues (duration: 00m 14s) [16:03:58] matt_flaschen: ^^^^^^ [16:04:01] Logged the message, Master [16:04:06] Thanks [16:04:34] sorry for the wait [16:05:00] No problem at all [16:05:03] (03Abandoned) 10BryanDavis: Trebuchet: run all state changing git commands with umask 002 [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) (owner: 10BryanDavis) [16:05:24] (03PS3) 10BryanDavis: git deploy: don't fetch/checkout/restart on the deployment server [puppet] - 10https://gerrit.wikimedia.org/r/212291 (https://phabricator.wikimedia.org/T67549) (owner: 10ArielGlenn) [16:07:41] (03CR) 10BryanDavis: "I haven't tried to run it but the code looks reasonable. We could cherry-pick to deployment-salt for some live testing." [puppet] - 10https://gerrit.wikimedia.org/r/212291 (https://phabricator.wikimedia.org/T67549) (owner: 10ArielGlenn) [16:08:18] (03PS5) 10Dzahn: Redirect dev.wikimedia.org URLs [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage) [16:09:25] (03CR) 10Dzahn: [C: 032] Redirect dev.wikimedia.org URLs [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage) [16:10:46] (03CR) 10Dzahn: [C: 032] monitoring.pp - indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/212233 (owner: 10Dzahn) [16:12:06] JohnLewis: hello [16:12:19] mutante: hello [16:12:50] JohnLewis: fyi, uploaded that script i used for "remove user X from all lists, but only the private ones" [16:13:18] and then there's a patch from matanya too [16:13:20] mutante: saw the gerrit patch. don't know if you merged it or not so checking gerrit now as there's other stuff [16:13:48] that one from matanya is simply making the service restart on config changes [16:14:43] JohnLewis: i'll add some puppet code to have that script deployed on sodium [16:15:29] (03CR) 10John F. Lewis: [C: 031] "sane and follows the correct process for handling mailman subscription removes via shell scripts." [puppet] - 10https://gerrit.wikimedia.org/r/212235 (owner: 10Dzahn) [16:15:34] :) [16:15:56] (03CR) 10Dzahn: [C: 032] mailman: script to remove user from private lists [puppet] - 10https://gerrit.wikimedia.org/r/212235 (owner: 10Dzahn) [16:17:07] (03CR) 10Dzahn: [C: 032] fundraising.pp: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/212234 (owner: 10Dzahn) [16:20:13] (03CR) 10John F. Lewis: [C: 031] "Ideally deploys should be accompanied with manual restarts as opposed to automated ones. If a service is auto restarted I fear ops will de" [puppet] - 10https://gerrit.wikimedia.org/r/212260 (owner: 10Matanya) [16:20:22] mutante ^ that's what I call an essay for a code review :p [16:21:10] JohnLewis: nice:) and .. i agree with the whole thing pretty much [16:21:33] was it really a change in mm_cfg.py though? [16:21:40] Yes. [16:21:42] ok [16:22:07] mailman is awkward in that it doesn't fail directly if the config fails or runs a cached config if it fails (like icinga) [16:22:15] it just runs and causes erorrs down the line [16:23:00] icinga would also fail if we just restarted it without checking first, was like that in the past [16:23:10] then code was added to always do that [16:23:19] icinga is vocal though [16:23:25] right [16:23:35] by default it runs cached, but can be foreced to fail as opposed to caching [16:24:14] godog: still there? [16:24:20] (03CR) 10Dzahn: [C: 031] "i agree with John here about the pros and cons this has. +1 for now" [puppet] - 10https://gerrit.wikimedia.org/r/212260 (owner: 10Matanya) [16:25:18] ottomata: sure [16:25:44] godog: so [16:25:49] as I am writing this thing [16:26:00] i am doing +1 for every dimension and storing in a dict [16:26:07] e.g. [16:26:23] for line in varnishncsa: [16:26:23] line[http_status} += 1 [16:26:35] mutante: I'm now looking into more things for mailman, mostly monitoring as godog suggested a fairly good idea regarding queues [16:26:36] but, this will run for a really long time [16:26:43] probably will overflow. [16:26:56] shoudl I be sending gauge? [16:27:01] rather than raw count seen? [16:27:12] JohnLewis: actually.. i thought i already did that [16:27:16] that is, shodul I reset the count each time collector runs? [16:27:30] mutante: consider it a review of it all and any misc improvements then :) [16:28:00] JohnLewis: right, "OK: mailman queues are below 42 " [16:28:15] I feel it may be a blackspot with mailman itself though [16:28:44] ottomata: I think not resetting the counter is what we want, and then calculate the rate on graphite side. the other option of course is to reset it but then it becomes a rate based on collector interval [16:29:12] JohnLewis: ok, please check, the file is in files/icinga/check_mailman_queue [16:29:39] JohnLewis: yep what mutante said, I don't think we're monitoring all queues tho? [16:30:11] 13 FILES="$mailman_base/in $mailman_base/out $mailman_base/virgin $mailman_base/shunt" [16:30:14] ^ [16:30:19] godog: true, but what about overflow? [16:30:29] i guess it will likely not happen, 64bit... [16:30:55] mutante: bounces is also one [16:31:02] or maybe python is just smart about it [16:31:03] JohnLewis: https://gerrit.wikimedia.org/r/#/c/199662/ [16:31:07] ottomata: if it rolls over it isn't any different than restarting diamond, we'd have to handle it anyway [16:37:34] godog: i'll take a look in a bit [16:38:01] kk [16:39:05] (03PS2) 10Jcrespo: Create new module for managing RAID settings [puppet] - 10https://gerrit.wikimedia.org/r/212027 (https://phabricator.wikimedia.org/T84178) [16:39:09] (03PS1) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/212302 (https://phabricator.wikimedia.org/T92693) [16:39:17] (03PS1) 10Dzahn: mailman: class for helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/212303 [16:40:43] I'm having greg move the phabricator deployment window because the current slot is difficult for me to be awake for. Is this a good time to catch opsen around in case something goes wrong? 10:30-11:00pm PST / 6:30-7:00am UTC [16:41:17] (03PS5) 10Ottomata: Add diamond collector varnishstats.py [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [16:41:27] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce base::puppet::config definition [puppet] - 10https://gerrit.wikimedia.org/r/212300 (owner: 10Alexandros Kosiaris) [16:41:35] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1299275 (10Niedzielski) Hey @Dzahn. I apologize but this doesn't seem to be working for me at the moment and I'm not sure if it's on my end or the server. I can l... [16:41:57] (03CR) 10jenkins-bot: [V: 04-1] Add diamond collector varnishstats.py [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [16:42:14] twentyafterfour: I think so, europe should be ~awake and sometimes US too [16:42:26] (03PS2) 10Dzahn: mailman: class for helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/212303 [16:42:44] 10:30pdt is only 6:30 CET, that seems pretty early to expect europe to be online [16:43:17] <_joe_> 7 AM UTC is ok-ish when we're at UTC+2 [16:43:41] <_joe_> at UTC+1 (winter) it's going to be difficult fo find someone online [16:43:44] <_joe_> maybe me [16:43:54] (03PS6) 10Ottomata: Add diamond collector varnishstats.py [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [16:43:56] the other option is putting it in the evening (US) which is late in europe [16:44:26] or doing it during business hours in california which is what I'm trying to avoid [16:44:27] godog: https://gerrit.wikimedia.org/r/#/c/212041/6/modules/diamond/files/collector/varnishstats.py [16:44:30] (03CR) 10jenkins-bot: [V: 04-1] Add diamond collector varnishstats.py [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [16:44:36] ok I think I misunderstood 'around' from 'online' [16:45:40] I should be able to handle anything that goes wrong, really. I have root on phabricator machine and I have extensive sysadmin experience ... it's more of a "just in case" thing [16:45:44] (03CR) 10John F. Lewis: [C: 031] "new classes are good" [puppet] - 10https://gerrit.wikimedia.org/r/212303 (owner: 10Dzahn) [16:46:02] <_joe_> twentyafterfour: oh for phabricator you mean [16:46:34] yeah [16:47:14] <_joe_> well, what about PST afternoon then? [16:47:16] (03PS7) 10Ottomata: Add diamond collector varnishstats.py [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [16:47:23] <_joe_> you'll have opsens around [16:47:52] (03PS3) 10Alexandros Kosiaris: puppetmaster: Move backups to the role class [puppet] - 10https://gerrit.wikimedia.org/r/209271 [16:47:54] (03PS3) 10Alexandros Kosiaris: puppetmaster::logstash. Fix out of module dependencies [puppet] - 10https://gerrit.wikimedia.org/r/209270 [16:47:56] (03PS3) 10Alexandros Kosiaris: puppetmaster: DRY on inclusion of hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/209272 [16:47:58] (03PS1) 10Alexandros Kosiaris: Remove the unneeded priorites in filenames [puppet] - 10https://gerrit.wikimedia.org/r/212305 [16:48:00] (03CR) 10jenkins-bot: [V: 04-1] Add diamond collector varnishstats.py [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [16:48:04] <_joe_> I think 1 PM PDT/PST translates to 10 PM in europe and 4 PM on the east coast [16:49:44] (03CR) 10Dzahn: [C: 032] mailman: class for helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/212303 (owner: 10Dzahn) [16:51:23] _joe_: that is during the west-coast workday which I was trying to avoid. [16:51:34] (03PS8) 10Ottomata: Add diamond collector varnishstats.py [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [16:51:34] 6operations: Requesting addition to researchers group on stat1003 - https://phabricator.wikimedia.org/T99798#1299294 (10Dbrant) 3NEW [16:52:22] (03CR) 10jenkins-bot: [V: 04-1] Add diamond collector varnishstats.py [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [16:55:18] <_joe_> twentyafterfour: why exactly? Do we have statistics suggesting phab is used more during that time period? [16:55:32] I have no data ;) [16:55:32] <_joe_> twentyafterfour: I won't be so sure about that, without data [16:55:47] the downtime should be <15 minutes anyway [16:57:33] _joe_, I can help with db stats for phab :-) [16:58:21] jynus: https://phabricator.wikimedia.org/T99295 [16:59:25] mutante, I can help with that too, but it will take a bit to filter private stuff [17:00:08] jynus: cool, thanks! how about the "read-only db user might be easier" [17:01:01] if the user has an NDA, maybe no need to filter private stuff [17:03:02] mutante, not really- the complicated thing is to not provide access to certain tickets (filtering by rows) [17:03:43] if it is just a question of not provifing access to real private things (passwords, users), table filtering, it would be easier [17:04:20] jynus: ah, ok, gotcha [17:05:00] jynus: if the user hasn't signed an NDA, NDA stuff will also have to be removed. though if he has a signed NDA - nda and security stuff can be provided if Chris says okay [17:05:06] (03PS9) 10Ottomata: Add diamond collector varnishstats.py [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [17:05:14] so best check the social factors before wasting time or having ot do things twice :) [17:05:39] (03PS10) 10Ottomata: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [17:05:42] JohnLewis, that is the point - sorry if I expressed poorly [17:06:16] jynus: probably me misreading stuff [17:06:18] (03CR) 10jenkins-bot: [V: 04-1] Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [17:06:26] phabricator doesn't store private user data or passwords, currently [17:06:48] jenkins is never pep8 happy [17:07:02] jynus: sorry I missed your ping earlier [17:07:15] jynus: one other question? do you know about sanitarium already? [17:07:41] mutante, I saw it but I didn't ping because it is on my TODO list [17:07:48] because of other ticket [17:08:03] but with now I would be a blocker [17:08:16] jynus: *nod*, i saw some docs that say when creating a new wiki sanitarium needs to be prepared for it .. somehow [17:08:18] ask again in a few days [17:08:29] jynus: thank you [17:08:40] sorry, cmjohnson1 [17:08:54] I pm you, if you hace some time now [17:09:13] sure [17:10:43] (03PS11) 10Ottomata: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [17:11:23] (03CR) 10jenkins-bot: [V: 04-1] Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [17:12:00] (03CR) 10Dzahn: [C: 032] mailman: restart mailman after config changes [puppet] - 10https://gerrit.wikimedia.org/r/212260 (owner: 10Matanya) [17:13:29] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1299386 (10coren) Creating a wiki and having it replicated in Labs may be thought of as an independent step that can be done at any poin... [17:16:58] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1299400 (10Krenair) Yes, we should be replicating all WMF publicly-visible wikis to labs, @coren. [17:20:25] 6operations, 10ops-eqiad: ssh connection to some management servers fails, a hard reset may be needed - https://phabricator.wikimedia.org/T99805#1299408 (10jcrespo) 3NEW [17:21:05] 6operations, 10ops-eqiad: ssh connection to some management servers fails, a hard reset may be needed - https://phabricator.wikimedia.org/T99805#1299418 (10jcrespo) [17:21:07] 6operations, 5Patch-For-Review, 7database: On a maintenance window, upgrade db1063 to 14.04 and its MariaDB package to 10.0.16 - https://phabricator.wikimedia.org/T99520#1299417 (10jcrespo) [17:21:57] 6operations, 10Continuous-Integration-Infrastructure, 6Release-Engineering, 5Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1299420 (10faidon) [17:23:36] (03CR) 10Filippo Giunchedi: "also given that this is our 'eyes' to varnish, tests would be nice to have" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [17:26:22] _joe_, twentyafterfour I see a relative flat line, maybe less activity at 3AM UTC, but consistent load at most times [17:26:51] cmjohnson1: there? [17:28:31] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1299435 (10Krenair) @coren: https://wikitech.wikimedia.org/w/index.php?title=Add_a_wiki&diff=160100&oldid=158643 [17:28:38] johnslewis: yep [17:28:41] what's up? [17:29:07] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1299436 (10csteipp) @JAufrecht can you describe the queries you'll be running? I'd like to not reveal contents or counts of security bugs. NDA group is probably the same, although t... [17:29:45] cmjohnson1: mind giving analytics1028 an ack. in icinga if its down because of the RAID thing yesterday? :) [17:29:56] yep....taking care of that now [17:30:45] thanks [17:32:16] ACKNOWLEDGEMENT - Host analytics1028 is DOWN: PING CRITICAL - Packet loss = 100% Chris Johnson this server experienced a fried raid card and news a new system board and raid card. [17:32:54] <^demon|zzz> qchris: today I learned that individual repos can override the max git object size setting. That's batshit. [17:33:14] ^demon|zzz: git doing something batshit? QUELLE SURPRISE! [17:33:14] ^demon|zzz: Your having bad dreams :-) [17:33:30] But yes, they can. [17:34:06] Any need to lock it down for us? [17:35:06] <^demon|zzz> yuvipanda: Not git, gerrit [17:35:15] ah that's what I meant [17:35:17] git is quit enice [17:35:21] <^demon|zzz> qchris: No, probably not. It just seems like something stupid :p [17:35:27] * ^demon|zzz blames the usual suspects [17:35:27] gerrit-annex [17:35:38] (03CR) 10Aklapper: [C: 031] phabricator: Add priority keywords/labels for !priority email command [puppet] - 10https://gerrit.wikimedia.org/r/209445 (https://phabricator.wikimedia.org/T98356) (owner: 10Merlijn van Deen) [17:36:52] ^demon|zzz: 'The project specific setting in `project.config` is only honored when it further reduces the global limit."' [17:37:06] <^demon|zzz> Ahhhh, ok so not so crazy. [17:38:35] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1299463 (10JAufrecht) There is sample data in [[ https://docs.google.com/spreadsheets/d/1oy83WsZLFFlBY4HIazXZrVv1UKrCWfoPKCoFii9LIYw/edit#gid=555608897| this spreadsheet ]]. For... [17:41:24] Hi [17:41:34] !log esams+eqiad upload varnish caches will be downtimed+rebooted today, experimenting with depool effects as well (next several hours) [17:41:37] getting some rather slow performance annd non page loads [17:41:40] Logged the message, Master [17:41:43] quick ?: what level of error_reporting to we use for normal PHP from web requests on the cluster? [17:41:51] Qcoder00: details? [17:42:07] https://en.wikipedia.org - Nothing happens [17:42:18] Attempts loading for a long time and then bails [17:42:28] with a "Problem Loading Page" in Firefox [17:42:52] Timeouts seemingly [17:43:03] And yet weirdly wikisource.org is OK [17:44:23] 6operations: dataset1001: add new disk array - https://phabricator.wikimedia.org/T99808#1299464 (10RobH) 3NEW a:3Cmjohnson [17:44:34] 6operations, 10ops-eqiad: dataset1001: add new disk array - https://phabricator.wikimedia.org/T99808#1299474 (10RobH) [17:44:37] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [17:44:47] 6operations, 10hardware-requests: order new array for dataset1001 - https://phabricator.wikimedia.org/T93118#1129802 (10RobH) [17:44:49] 6operations, 10ops-eqiad: dataset1001: add new disk array - https://phabricator.wikimedia.org/T99808#1299464 (10RobH) [17:45:06] Qcoder00: what platform, operating system, type of network, geographical location? [17:45:08] OK Wikisource now time-out [17:45:24] Windows XP, Firefox, UK network [17:45:30] and my ISP is THUS [17:45:37] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [17:45:59] Qcoder00: the rest of your network connectivity is OK? [17:46:04] Seemingly [17:46:10] In that I can access bbc.co.uk [17:46:35] Do you know of a .nl site in English i can use as a test link? [17:46:38] 6operations, 10ops-eqiad: dataset1001: add new disk array - https://phabricator.wikimedia.org/T99808#1299464 (10RobH) [17:46:40] 6operations, 10hardware-requests: order new array for dataset1001 - https://phabricator.wikimedia.org/T93118#1299483 (10RobH) 5Open>3Resolved This has been ordered and will arrive today/tomorrow. T99808 is for the installation, and receipt is tracked in RT https://rt.wikimedia.org/Ticket/Display.html?id=9... [17:47:03] Qcoder00: hmmm uh no [17:48:24] Qcoder00: http://www.evoswitch.com/ [17:48:26] I can see - http://translate.google.co.uk/translate?hl=en&sl=nl&tl=en&u=http%3A%2F%2Fnos.nl%2F&anno=2 [17:48:50] mutnate: Comes up very quckly... [17:48:56] but I can't see Wikipedia [17:49:28] Qcoder00: could you run traceroute and paste the results? [17:51:24] Sorry for flood [17:51:25] 05/20/15 18:49:57 Fast traceroute en.wikipedia.org [17:51:27] Trace en.wikipedia.org (91.198.174.192) ... [17:51:28] 1#REDACTED# 4ms 4ms 4ms TTL: 0 (No rDNS) [17:51:29] Qcoder00: any difference whether you use http or https ? [17:51:30] 2 #REDACTED# 16ms 16ms 20ms TTL: 0 (No rDNS) [17:51:31] 3 194.217.23.21 16ms 17ms 20ms TTL: 0 (anchor-access-4-s2006.router.demon.net ok) [17:51:33] 4 194.159.161.78 16ms 16ms 16ms TTL: 0 (gi4-0-0-dar4.lah.uk.cw.net ok) [17:51:34] 5 193.195.25.34 16ms 16ms 20ms TTL: 0 (xe-11-1-0-xur1.lns.uk.cw.net ok) [17:51:36] 6 195.66.236.175 16ms 20ms 16ms TTL: 0 (linx-2.init7.net probable bogus rDNS: No DNS) [17:51:38] 7 82.197.168.42 23ms 26ms 17ms TTL: 0 (r1lon2.core.init7.net fraudulent rDNS) [17:51:39] 8 77.109.128.34 66ms 24ms 28ms TTL: 0 (r1ams2.core.init7.net fraudulent rDNS) [17:51:41] 9 77.109.134.114 23ms 23ms 24ms TTL: 0 (gw-wikimedia.init7.net probable bogus rDNS: No DNS) [17:51:42] 10 91.198.174.192 23ms 27ms 23ms TTL: 54 (text-lb.esams.wikimedia.org ok) [17:51:46] http works [17:51:49] https doesn't [17:52:02] Qcoder00: that would be because Windows XP then most likely [17:52:18] It was working earlier today [17:52:22] Qcoder00: any other browsers to compare? [17:52:25] hmm [17:52:50] no error message at all? [17:53:20] Loads in chrome [17:53:29] Qcoder00: doesn't in IE? [17:53:37] I think he said FF [17:53:49] loads in chrome doesn't load in FF => sounds like IPv6 issue [17:54:06] I don't as far as I know use IPv6 [17:54:22] can you double-check? [17:54:30] How do I do that? [17:54:39] try running "ipconfig /all" [17:55:50] OK I'm seeing a mixture of Ip4 and IP6 addresses [17:56:26] do you see any IPv6 addresses that do not start with fe80? [17:57:25] they all start with fec0 [17:57:40] ok, we can ignore these as well [17:57:41] in respect of DNS [17:58:00] And no I don't see any others that start with anything else [17:58:01] (03PS1) 10BBlack: depool cp1050 in puppet (testing) [puppet] - 10https://gerrit.wikimedia.org/r/212309 [17:58:27] (03CR) 10BBlack: [C: 032 V: 032] depool cp1050 in puppet (testing) [puppet] - 10https://gerrit.wikimedia.org/r/212309 (owner: 10BBlack) [17:58:46] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1299524 (10MoritzMuehlenhoff) As mentioned on IRC yesterday I don't have a strong preference either way; after all all of the information (except user profile settings) was exposed vi... [17:59:00] I'm still puzzled... [17:59:13] 7Blocked-on-Operations, 6Collaboration-Team, 10Echo, 6Scrum-of-Scrums, 7Schema-change: Perform schema change to echo_target_page changing from a 1 to 1 mapping between pages and user/notification to a 1 to many. - https://phabricator.wikimedia.org/T94427#1299530 (10Mattflaschen) Friendly bump. This is b... [17:59:16] But it's looking less like a Wikimedia issue this end [18:00:01] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1299534 (10RobH) @Yurik, I still don't have the information on what exactly you plan to do with the system... [18:00:04] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150520T1800). Please do the needful. [18:00:58] 7Blocked-on-Operations, 6Collaboration-Team, 10Echo, 6Scrum-of-Scrums, 7Schema-change: Perform schema change to echo_target_page changing from a 1 to 1 mapping between pages and user/notification to a 1 to many. - https://phabricator.wikimedia.org/T94427#1299536 (10Mattflaschen) [18:03:27] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [18:10:50] OK Tested in IE and Chrome [18:10:52] Loads [18:10:59] but not in Firefox [18:13:25] (03CR) 1020after4: [C: 031] [WIP] Add Vagrantfile [puppet] - 10https://gerrit.wikimedia.org/r/212294 (owner: 10BryanDavis) [18:14:25] 6operations, 10Wikimedia-Mailing-lists: close and delete the flowfunding mailing list - https://phabricator.wikimedia.org/T97328#1299549 (10RobH) 5Open>3Resolved The archives are public, so I've followed the directions on https://wikitech.wikimedia.org/wiki/Lists.wikimedia.org#Disable_a_mailing_list which... [18:15:10] robh: finishing the tasks you didn't do yesterday I see :p [18:15:26] yep =] [18:15:38] by the end of yesterday i didnt wanna touch mailman again [18:15:40] heh [18:15:43] making a ticket for the issue described by Qcoder00 [18:20:33] 6operations, 10Wikimedia-Mailing-lists: close and delete the flowfunding mailing list - https://phabricator.wikimedia.org/T97328#1299566 (10RobH) [18:20:37] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1299567 (10RobH) [18:26:42] never mind, talked more in PM , sounds like personal browser issue now [18:27:19] thanks for the merge mutante [18:27:45] matanya: thanks for the patch [18:27:53] :) [18:29:22] (03PS1) 10BBlack: Revert "depool cp1050 in puppet (testing)" [puppet] - 10https://gerrit.wikimedia.org/r/212313 [18:29:36] (03CR) 10BBlack: [C: 032 V: 032] Revert "depool cp1050 in puppet (testing)" [puppet] - 10https://gerrit.wikimedia.org/r/212313 (owner: 10BBlack) [18:30:07] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:40:24] 6operations, 7Epic, 10Wikimedia-Mailing-lists: Rename all mailing lists with -l suffixes to get rid of that suffix - https://phabricator.wikimedia.org/T99138#1299616 (10RobH) a:5RobH>3None I'm not sure it is worth doing to a list, unless that list requests it. We don't make them going forward with the -... [18:45:47] PROBLEM - Host analytics1030 is DOWN: PING CRITICAL - Packet loss = 100% [18:46:07] ^ that you, ottomata? [18:47:44] mutante: is it by design them lists.wikimedia.org doesn't support tls 1.3 and PFS ? [18:48:59] 6operations, 7Epic, 10Wikimedia-Mailing-lists: Rename all mailing lists with -l suffixes to get rid of that suffix - https://phabricator.wikimedia.org/T99138#1299646 (10Philippe-WMF) I see no real benefit to this, and the risk of disruption. The consistency argument is not compelling, to my mind. [18:49:22] 6operations, 10Traffic: Reboot caches for kernel 3.19.6 globally - https://phabricator.wikimedia.org/T96854#1299647 (10BBlack) I did some testing to re-confirm upload cache behavior on reboots (while getting cp1048-cp1050 rebooted in the process): A reboot without any depooling or only depooled in pybal resul... [18:49:27] RECOVERY - Host analytics1030 is UPING OK - Packet loss = 0%, RTA = 2.36 ms [18:49:30] 6operations, 7Epic, 10Wikimedia-Mailing-lists: Rename all mailing lists with -l suffixes to get rid of that suffix - https://phabricator.wikimedia.org/T99138#1299648 (10JohnLewis) 5Open>3declined a:3JohnLewis [18:52:36] matanya: it's because the distro version is old [18:52:44] I see [18:54:34] bblack, do you know if response_size that varnish sends is compressed or uncompressed? [18:55:45] that it sends where? [18:56:24] maybe a better question is: what is response_size in this context and where are you seeing it? [18:56:47] yurik: ^ [18:57:07] (03CR) 10Ori.livneh: "@bd808: You can declare this configuration option for just the canary app / api servers in hieradata/role/common/mediawiki/appserver/canar" [puppet] - 10https://gerrit.wikimedia.org/r/211155 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis) [19:02:35] (03PS4) 10Yuvipanda: Tools: Puppetize database aliases as host resources [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:03:28] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1299694 (10MaxSem) >>! In T97638#1299534, @RobH wrote: > * Testing implementations of the database, does th... [19:03:30] JohnLewis: do you have some info about the hardware usage of mailman? e.g cpu/ram/network/HDD ? [19:04:04] matanya: now that depends on the usage and why [19:05:50] is your labs mailman an exact copy of prod? I'm asking cause i want to know how heavy is mailman in terms of hardware, which will in turn will iluminate me about https://phabricator.wikimedia.org/T82698#1285294 [19:06:07] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [19:06:12] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1299696 (10RobH) On the storage: If it was SSDs, are the SSDs needed for OS and data, or only for the data... [19:06:53] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1299699 (10RobH) On the cpu/ram: I have multiple options on the spares page, including more memory and cpu... [19:07:54] (03PS2) 10BryanDavis: Set HHVM mysql connection timeout to 3s on canary servers [puppet] - 10https://gerrit.wikimedia.org/r/211155 (https://phabricator.wikimedia.org/T98489) [19:12:20] (03CR) 10Yuvipanda: "Puppet is surprisingly smart about managing /etc/hosts files - https://phabricator.wikimedia.org/P663 is the diff on applying this patch. " [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:14:07] PROBLEM - Host analytics1031 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:46] (03CR) 1020after4: [C: 032] Remove 1.26wmf1 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212319 (owner: 1020after4) [19:14:54] (03PS1) 10EBernhardson: Only return mostly fresh data for elasticsearch ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/212322 [19:15:31] (03Merged) 10jenkins-bot: Remove 1.26wmf1 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212319 (owner: 1020after4) [19:16:17] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:17:48] (03PS2) 10EBernhardson: Only return mostly fresh data for elasticsearch ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/212322 [19:17:55] matanya: labs is a dupe in terms of functionality. [19:18:21] labs runs Debian Jessie while prod runs Ubuntu Lucid. labs also has usage [19:18:55] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1299722 (10JohnLewis) >>! In T82698#1285294, @RobH wrote: > Do we have any specific requirements for a new system for this, other than 'similar to sodium'? > > This seems to have stalled out, due to other items taking precedence, b... [19:19:11] robh ^ a comment for the hw allocation side of you at least [19:20:02] cool, thx [19:21:13] have fun poking for an update off the two you need to though :p [19:21:34] bblack, sorry, what i meant was - when we store response_size (from %b varnishncsa) in hadoop logs, is it compressed or not? [19:22:31] hey godog, still around? [19:23:42] yurik: in general, we don't compress much. I think it's only turned on for... ico files and svg files? I'll have to look at what varnishncsa gets response_size from and all that, but I'd assume it's the wire response size (so, affected by compression) [19:24:07] bblack, we don't compress HTML ?!? [19:24:07] the applayer backends could compress other things in theory, but I don't think they do. [19:24:56] matanya: but yeah, labs basically emulates mailman after its upgraded which robh said he'll work on getting some progress on so lucid is killed. all it needs is the actual usage. mail is handled and everything :) [19:25:19] (03CR) 10Yuvipanda: Tools: Puppetize database aliases as host resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:25:48] (03CR) 10Dzahn: "..or is it possible to actually have things in a module in the private repo ?" [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [19:26:02] yurik: we do, just not inside of varnish [19:26:14] I think it's not happening inside of varnish, anyways [19:26:22] bblack, doesn't varnish uncompress stuff if the client doesn't support it? [19:26:37] yes [19:26:59] (03PS5) 10Yuvipanda: tools: Puppetize database aliases as host resources [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:27:05] that part's generally easy, it's the part where varnish does the compressing for the backend that's trickier and we only do for ico/svg [19:27:13] (currently) [19:30:40] when we are adding new appservers, is there a step like "initial sync of mediawiki" that people do, or is it really just adding to dsh group and the next normal sync does it [19:31:12] just dont want to surprise deployers [19:31:24] i'm noticing that the stats in ganglia for the es cluster are fairly off, for example running the python monitoring script manually gets ~79M for es_docs_count on es1004, but ganglia has been reporting a steady 20M for a few months now. is it ok to bounce gmond there (and perhaps other servers in the es cluster if that shows to help)? http://bit.do/4U6M [19:31:26] RECOVERY - Host analytics1031 is UPING OK - Packet loss = 0%, RTA = 1.05 ms [19:31:30] yurik: yeah varnishncsa %b should correlate to wire response size, minus headers (in other words, Content-Length, which varies with Content-Encoding such as gzip) [19:31:35] twentyafterfour / greg-g ^^ (mutante's question) [19:31:59] !log twentyafterfour Started scap: testwiki to php-1.26wmf7 and rebuild l10n cache [19:32:06] Logged the message, Master [19:32:13] bblack, ? so it is the real transmission size minus headers? [19:33:36] mutante: I'm not aware of any initial sync [19:33:57] twentyafterfour: so i would like to try and add "mira" , it's supposed to become the "tin of codfw" [19:34:12] twentyafterfour: but it's not like all the other appservers, so there might be issues on first run [19:34:23] the "tin of codfw" sounds like an awesome LCD tag ;) [19:34:24] nevertheless we want mw on it [19:35:03] (03CR) 10Tim Landscheidt: [C: 04-1] "I did test this on Toolsbeta, with "sql $lastdbinlongline", to make sure that it didn't hit some maximum line limit. I was going to remov" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:35:13] mutante: you want me to run an extra sync just to be sure? when will it be ready? [19:35:21] !log twentyafterfour scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_3448528422" --threads=4 --lang en --quiet' returned non-zero exit status 255 (duration: 03m 22s) [19:35:26] Logged the message, Master [19:36:01] twentyafterfour: if you can sync it to just mira manually that would be nice i think [19:36:27] twentyafterfour: i didnt put it into pybal... [19:36:45] yurik: it's what I said it was :) [19:37:06] yurik: "real transmission size" means... what at what layer? there are many protocol layers here heh [19:37:17] PROBLEM - Host analytics1032 is DOWN: PING CRITICAL - Packet loss = 100% [19:37:26] JohnLewis: R320 [19:37:37] yurik: it's content-length, essentially [19:37:54] mutante: nope :'( [19:38:04] if it did, we need it set to 'tin of codfw' [19:38:23] JohnLewis: but not role::tin-of-codfw please :p [19:38:33] (03CR) 10Yuvipanda: tools: Puppetize database aliases as host resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:38:43] that'll come later! [19:40:30] (03CR) 10Yuvipanda: "(I'm confused about the -1)" [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:40:43] i wonder if one of the 2 things dsh group and applying puppet role needs to go first [19:41:08] we can set cluster to "misc" already [19:41:35] (03CR) 10Dzahn: [C: 032] mira, codfw deploy server, set $cluster [puppet] - 10https://gerrit.wikimedia.org/r/210836 (owner: 10Dzahn) [19:42:13] mutante: I'd apply the role first [19:42:20] run puppet and see how that goes [19:43:43] (03CR) 10Tim Landscheidt: "(-1 because of the duplicate "tools.labsdb".)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:44:14] (03CR) 10EBernhardson: "While this doesn't fix the issue, i'm pretty sure this async stuff is the source of our problems. Bouncing gmond on elastic1004 immediatl" [puppet] - 10https://gerrit.wikimedia.org/r/212322 (owner: 10EBernhardson) [19:44:35] JohnLewis: right, it does a lot of things [19:44:47] RECOVERY - Host analytics1032 is UPING OK - Packet loss = 0%, RTA = 4.37 ms [19:44:49] as long as that cant influence normal deployment [19:45:04] (03CR) 10Yuvipanda: "Where's the duplicate?" [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:45:14] and no more hardcoded tin [19:45:48] mutante: it won't influence deploymenty [19:45:58] 6operations, 10Traffic: Reboot caches for kernel 3.19.6 globally - https://phabricator.wikimedia.org/T96854#1299793 (10BBlack) Did some further testing and investigating on the above. I'm starting to think this isn't inherent to a difference in upload's traffic characteristic, and is instead all about the con... [19:47:11] when will it actually be possible to deploy from mira instead of tin? [19:47:23] (03CR) 10Dzahn: "duplicate of https://gerrit.wikimedia.org/r/#/c/209874/4" [puppet] - 10https://gerrit.wikimedia.org/r/210837 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [19:47:37] (03Abandoned) 10Dzahn: add role::deployment::server to mira.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/210837 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [19:47:58] (03PS6) 10Yuvipanda: tools: Puppetize database aliases as host resources [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:48:47] Krenair: it depends how many issues we run into after applying the role / how good the existing puppetiziation of tin was [19:48:55] about to find out more [19:49:36] (03CR) 10Yuvipanda: tools: Puppetize database aliases as host resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:51:56] (03CR) 10Tim Landscheidt: [C: 031] tools: Puppetize database aliases as host resources [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:52:17] (03PS7) 10Yuvipanda: tools: Puppetize database aliases as host resources [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:52:25] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Puppetize database aliases as host resources [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [19:53:29] (03PS5) 10Dzahn: Add deployment server role to mira [puppet] - 10https://gerrit.wikimedia.org/r/209874 (https://phabricator.wikimedia.org/T95436) (owner: 10John F. Lewis) [19:55:08] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1299819 (10Yurik) @Robh, We already had a meeting with @mark and a few more people from Ops. I think Mark h... [19:55:36] (03CR) 10Dzahn: [C: 032] Add deployment server role to mira [puppet] - 10https://gerrit.wikimedia.org/r/209874 (https://phabricator.wikimedia.org/T95436) (owner: 10John F. Lewis) [19:56:05] robh, ^^ [19:57:16] PROBLEM - Host analytics1033 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:57] ottomata: known? [19:58:17] looks on mgmt [19:58:45] yes known [19:58:50] alright [19:58:51] hm, weird that it is alerting for this [19:58:54] disconnects [19:59:00] i scheduled these for downtimes i though [19:59:02] thought [19:59:04] the others didn't alert [19:59:17] i'm in mgmt now [19:59:21] ottomata: you remind me, so i added a script to schedule downtimes much easier [19:59:26] ohhh? [19:59:28] you did?! [19:59:29] how? [19:59:31] tell me the goodness [20:00:04] analtyics1030 alerted [20:00:05] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150520T2000). [20:00:30] hm [20:00:37] RECOVERY - Host analytics1033 is UPING OK - Packet loss = 0%, RTA = 2.65 ms [20:00:39] (not sms, just irc) [20:00:47] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 51 failures [20:00:52] ottomata: [20:00:54] [neon:~] $ sudo icinga-downtime [20:00:54] usage: /usr/local/bin/icinga-downtime -h -d -r [20:01:10] amaazing :) [20:01:12] hostname = as listed in icinga web ui, duration = seconds [20:02:26] great! [20:02:27] it works! [20:02:35] :) user will be listed as "marvin-bot" [20:02:44] because it's a _down_time bot :p [20:05:44] Krenair: i see it installing tons of mw related packages now btw [20:07:04] PROBLEM - Keyholder SSH agent on mira is CRITICAL: NRPE: Command check_keyholder not defined [20:07:51] (03CR) 10Yuvipanda: "All done :) Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [20:08:03] PROBLEM - Translation cache space on mira is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [20:09:12] ah,, icinga [20:09:13] mutante, so the keyholder was not properly puppetised? [20:09:16] Wrapped exception: [20:09:16] No such file or directory - /srv/mediawiki/private/WikitechPrivateLdapSettings.php20150520-26392-1j51isb.lock [20:09:26] umm [20:09:54] so it probably needs to be: apply puppet role and ignore errors, use sync scripts, run puppet again? [20:10:25] actually it's still not done with the run [20:11:58] !log deployed parsoid version 8ed6fd0b [20:12:07] Logged the message, Master [20:12:36] (03CR) 10Tim Landscheidt: "It is only intended to make the workaround puppetier :-). T63897 stills remains open." [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [20:15:48] (03CR) 10Yuvipanda: "Yup. Need to wait for split horizon, move to designate and then move it around, I guess." [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [20:16:54] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: NRPE: Command check_mediawiki_config_merged not defined [20:17:14] turns off the mira notifications [20:17:20] aude, paravoid: no idea regarding the Location: http://www.wikidata.org/wiki/Main_Page problem. is there a phabricator ticket about it or was it solved? [20:17:29] Krenair: it kind of just froze there .. grmbl [20:18:14] and i had to complain because now it continues minutes later.. [20:18:59] Notice: Finished catalog run in 1324.90 seconds [20:19:09] JohnLewis: ^ performance [20:19:26] (03PS1) 10John F. Lewis: mira: add ssh::hostkeys-collect as role [puppet] - 10https://gerrit.wikimedia.org/r/212414 [20:19:27] wow [20:19:31] it's likely the most complex role to apply to anything [20:19:33] on first run [20:19:50] yeah [20:19:54] RECOVERY - Translation cache space on mira is OK: HHVM_TC_SPACE OK TC sizes are OK [20:20:00] but hey, it did not actually break, just a little bit :) [20:20:12] tc cache is happy so :) [20:20:15] (03CR) 10Manybubbles: "The trouble is that the current version of Elasticsearch that we're using sometimes takes a long time to return data. That is why I had it" [puppet] - 10https://gerrit.wikimedia.org/r/212322 (owner: 10EBernhardson) [20:20:17] (03CR) 10Manybubbles: [C: 031] "The trouble is that the current version of Elasticsearch that we're using sometimes takes a long time to return data. That is why I had it" [puppet] - 10https://gerrit.wikimedia.org/r/212322 (owner: 10EBernhardson) [20:20:26] (03PS9) 10Paladox: Rename $wmincClosedWikis to $wgwmincClosedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [20:20:56] (03CR) 10Dzahn: "13:19 < mutante> Notice: Finished catalog run in 1324.90 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/209874 (https://phabricator.wikimedia.org/T95436) (owner: 10John F. Lewis) [20:21:58] mutante: the icinga round up is: keyholder is not armed and nrpe has one 'unable to read output' error so all seems somewhat good :) [20:22:24] JohnLewis: pretty good given the expectations of trouble [20:22:32] definitely [20:22:53] PROBLEM - puppet last run on mira is CRITICAL Puppet has 7 failures [20:24:02] using the script to schedule downtime on mira too [20:24:14] (03PS12) 10Ottomata: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [20:24:16] lets see the second puppet run now [20:24:43] !log twentyafterfour Started scap: retry: testwiki to php-1.26wmf7 and rebuild l10n cache [20:24:43] (03CR) 10Ottomata: "Good point, done." [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [20:24:50] Logged the message, Master [20:24:56] (03CR) 10jenkins-bot: [V: 04-1] Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [20:26:10] (03PS1) 10Ricordisamoa: Add task ids for e41f9ab31a44a68a5979e38b5160c01f58135e49 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212436 [20:32:21] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1299960 (10Ottomata) Question: How should invalid statuses or methods be counted? I have seen very rare bad data printed out by varnishncsa for %m and %s. If any of %m or %s look bad, e.g. %m is not o... [20:33:19] No such file or directory - /home/l10nupdate/.ssh/id_rsa.pub20150520-21671-13c9lva.lock [20:33:22] hmm [20:33:45] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1299964 (10Ottomata) Also, please opine as to your favorite metric name hierarchy now. As this is diamond, I think the default prefix will be servers.hostname, so: ``` servers.cp1052.eqiad.varnish.text... [20:35:16] (03CR) 10Dzahn: [C: 032] mira: add ssh::hostkeys-collect as role [puppet] - 10https://gerrit.wikimedia.org/r/212414 (owner: 10John F. Lewis) [20:37:12] ACKNOWLEDGEMENT - Keyholder SSH agent on mira is CRITICAL Keyholder is not armed. Run keyholder arm to arm it. daniel_zahn still being set up [20:37:12] ACKNOWLEDGEMENT - puppet last run on mira is CRITICAL Puppet has 4 failures daniel_zahn still being set up [20:38:26] JohnLewis: it's collecting all the hostkeys, i gotta continue later, have an RL thing coming up [20:42:08] (03PS10) 10Paladox: Rename $wmincClosedWikis to $wgwmincClosedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [20:42:16] (03PS11) 10Paladox: Rename $wmincClosedWikis to $wgWmincClosedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [20:42:17] !log restarted gmond on elastic10{01..31}.eqiad.wmnet [20:42:22] Logged the message, Master [20:43:27] (03PS12) 10Paladox: Rename $wmincClosedWikis to $wgWmincClosedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [20:45:26] <_joe_> ebernhardson: why are you restarting gmond? [20:46:42] _joe_: its been reporting the wrong values for elasticsearch data for a few months, aggregate es_docs_count went from the 600M its been at for months to 2.2B (where it should be) right after restart [20:46:46] _joe_: http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Elasticsearch+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=es_docs_count&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [20:46:54] _joe_: intially tested by manually running the es data collecting script gmond uses [20:47:05] <_joe_> ok makes sense :) [20:47:43] <_joe_> ebernhardson: I am asking as if the issue was "ganglia graphs blank" it's rarely gmond's fault [20:47:45] i'm a bit surprised that you have privs to do that [20:47:50] not that i'm opposed [20:48:15] well, was recently granted root only in the elastic cluster, it seemed sane and i checked around a few things first to make sure it would work as expected [20:48:28] <_joe_> jgage: we discussed this in the ops meeting [20:49:15] yeah, fine with me. this is part of a larger discussion about granting different levels of access [20:49:33] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1300008 (10csteipp) @JAufrecht, and I'm assuming you're going to make the results public? If so, please do exclude any private issues-- you can either exclude everything in #security... [20:49:35] <_joe_> ebernhardson: sorry, I just wanted to check if you were getting into ganglia debugging hell. It seems you are, but from a different angle than the usual one :P [20:50:08] _joe_: well, still have to figure out how it got stuck returning all the same data for months :) so yes will be in ganglia hell (but not ganglia directly, just the script that gets the es data and returns it) [20:50:17] that's a weird problem, but one i've seen before [20:50:20] (03PS1) 10Yuvipanda: labs: Set labs nameserver IP globally in $::nameservers [puppet] - 10https://gerrit.wikimedia.org/r/212444 [20:50:20] ganglia is mysterious [20:50:24] ^ if anyone wants to review [20:50:33] (03PS2) 10Yuvipanda: labs: Set labs nameserver IP globally in $::nameservers [puppet] - 10https://gerrit.wikimedia.org/r/212444 [20:50:46] !log twentyafterfour Finished scap: retry: testwiki to php-1.26wmf7 and rebuild l10n cache (duration: 26m 02s) [20:50:51] Logged the message, Master [20:50:58] <_joe_> ebernhardson: as I said, different angle :P [20:51:01] :) [20:51:12] <_joe_> "good luck with that" [20:51:14] globals are bad, I guess, but globals better than hardcoding an IP in 5 different places :) [20:51:16] mutante: did your server get sync'd ? [20:51:30] <_joe_> yuvipanda: who sid globals are bad? [20:51:40] me? :) [20:51:47] well, as a general rule, I guess. [20:52:15] depends on what you mean by globals, etc [20:52:37] my earlier sentence would've been better if I had just said 'better than hardcoding an IP in 5 different places' and made no mention of globals, I guess [20:53:03] _joe_: I'm also going to write a 'resolve' puppet function [20:53:05] (in wmflib) [20:54:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I like the idea, a couple of issues with the implementation." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212444 (owner: 10Yuvipanda) [20:54:39] <_joe_> yuvipanda: with cache and honoring TTLs? [20:54:48] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1300029 (10JAufrecht) I have no objections to working with a dump that is already sanitized. I will make results public, but they'll look like {F167401}, so pretty high-level compar... [20:54:52] <_joe_> yuvipanda: I guess I wrote one in the strongswan module [20:54:57] oh did you? [20:55:08] ipresolve :) [20:55:16] (03CR) 1020after4: [C: 032] Wikipedias to 1.26wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212320 (owner: 1020after4) [20:55:22] (03Merged) 10jenkins-bot: Wikipedias to 1.26wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212320 (owner: 1020after4) [20:55:30] <_joe_> https://github.com/wikimedia/operations-puppet/blob/production/modules/strongswan/lib/puppet/parser/functions/ipresolve.rb [20:55:48] <_joe_> jgage: can you believe I didn't remember at first? [20:55:54] yes ;) [20:56:10] <_joe_> I was about to tell yuvi "that would be a great idea. Oh wait, I did it" [20:56:32] starting ocg deploy [20:56:35] <_joe_> btw, the code looks horrible [20:56:40] btw when a lookup fails, the result i get is that section of the template is not output. i'm not sure if that's the behavior i want, it might be better to throw an error. [20:56:43] (03PS3) 10Yuvipanda: labs: Set labs nameserver IP globally in $::nameservers [puppet] - 10https://gerrit.wikimedia.org/r/212444 [20:56:48] :D [20:57:02] <_joe_> jgage: uhm you're right [20:57:09] <_joe_> I should look into that [20:57:13] yeah, I think throwing an error would be the right thing to do [20:57:19] why is it in the strongswan module and not in wmflib? [20:57:21] i'll open a task if you like [20:57:25] mutante: looks like mira got synced just fine [20:57:38] I also need it to accept a nameserver IP to use :) [20:57:38] <_joe_> jgage: better to do that [20:57:41] k [20:57:55] because labs puppetmaster runs in prod, points to prod DNS... [20:57:56] <_joe_> yuvipanda: so refactor and move it :P [20:57:58] yeah [20:58:12] better than writing my own :) [20:58:18] jgage: can you cc me on the task as well? [20:58:21] sure [20:58:36] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.26wmf6 [20:58:40] i'm eating lunch at the moment but i'll create it today [20:58:41] Logged the message, Master [20:58:58] <_joe_> jgage: ok, it's the only way we will remember [20:59:07] (03CR) 1020after4: [C: 032] Group0 to 1.26wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212321 (owner: 1020after4) [20:59:13] <_joe_> and maybe yuvi can look at it [20:59:13] (03Merged) 10jenkins-bot: Group0 to 1.26wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212321 (owner: 1020after4) [20:59:14] <_joe_> :D [20:59:16] _joe_: hmm, so if there's no cache entry and DNS fails you dont' do anything, and I guess it should fail [20:59:46] <_joe_> yuvipanda: I agree, wrote that on the rooftop of club quarters, mostly [20:59:50] I'm going to do DNS based redis failover in toollabs, and feel somewhat dirty about it [20:59:50] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.26wmf7 [20:59:55] Logged the message, Master [20:59:59] I wonder if I should use nutcracker instead [21:00:07] but this is arbitray number of connections... [21:00:11] from arbitrary code [21:00:21] <_joe_> yuvipanda: wait [21:00:29] <_joe_> what are you trying to do? [21:00:33] <_joe_> I mean really [21:00:34] ah [21:00:34] so right now [21:00:38] tools-redis.eqiad.wmflabs [21:00:46] is a redis instance that individual tool suse [21:00:47] *tools use [21:00:58] but - if the virt* host on which that instance is in dies [21:00:59] !log updated OCG to version ca4f64852de5b1de782b292b50038fbd2dd84266 [21:01:04] Logged the message, Master [21:01:06] I can't failover it, even though I have a slave [21:01:16] because code is directly pointing to the DNS hostname 'tools-redis.eqiad.wmflabs' [21:01:49] so my options are basically to 1. Switchover via DNS, 2. Setup nutcracker on all hosts, alias tools-redis.eqiad.wmflabs to localhost, and switchover via nutcracker [21:02:07] (2) is more flexible but also more complex, has another moving piece... [21:02:52] _joe_: redis and cron are the only unfailoverable services left [21:02:55] (and NFS of course) [21:03:05] (03CR) 10Yuvipanda: labs: Set labs nameserver IP globally in $::nameservers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212444 (owner: 10Yuvipanda) [21:03:32] Undefined variable: wgGeSHiSupportedLanguages in /srv/mediawiki/wmf-config/CommonSettings.php on line 561 [21:03:53] <_joe_> yuvipanda: uhm, I would've used nutcracker :) [21:04:02] this is flooding logs now that I updated versions [21:04:22] _joe_: uh, uh. That's still an option. [21:04:35] _joe_: note that by 'DNS' I don't actually mean 'DNS' but /etc/hosts because we're lame like that :) [21:04:36] <_joe_> also, I think you have a fundamental problem with redis HA [21:04:49] this isn't HA, just A :) [21:05:01] but what's the fundamental problem? [21:05:16] <_joe_> which is, if you move to the slave and write there, you need to do a lot of reconfiguration to make it a new master [21:05:22] _joe_: not really. [21:05:26] <_joe_> so the supposed solution is redis sentinel [21:05:29] _joe_: slaves are readonly by default [21:05:34] <_joe_> which, EW [21:05:40] so writes will just fail until it decides its the master [21:05:55] redis master / slave switchover is already part of tools-webproxy switchover instructions, and it works fairly reliably [21:06:20] <_joe_> well, in the context of toollabs, probably [21:06:23] yeah [21:06:27] <_joe_> in prod it's a nightmare [21:06:29] <_joe_> :) [21:06:32] it does have multi minutes switchovers [21:06:39] but I'm aiming for 99.5% here :D [21:06:44] so, it's ok, I think [21:07:24] nutcracker + sentinel and maybe redis-cluster [21:07:28] when we aim higher, maybe :) [21:07:40] <_joe_> nah [21:07:46] (03PS4) 10Yuvipanda: labs: Set labs nameserver IP globally in $::nameservers [puppet] - 10https://gerrit.wikimedia.org/r/212444 [21:07:57] and it'll lose writes during the switchover period [21:07:58] which is also fine [21:08:16] <_joe_> yuvipanda: ok so expectations are set low enough. Good! [21:08:25] 7Puppet, 6operations, 5Interdatacenter-IPsec: Puppet function: ipresolve: throw an error if lookup fails, refactor into wmflib - https://phabricator.wikimedia.org/T99833#1300042 (10Gage) 3NEW [21:08:33] <_joe_> it's important to set them like that [21:08:41] should I just ignore that error or roll back? I don't think it's breaking anything but it's flooding the logs badly [21:08:47] _joe_: it's toollabs, I don't think I've to try to set them low [21:08:58] > Currently switching them isn't very... possible. Worst case, you would create a new instance called 'tools-redis', replicate from the slave to it, and then make it master, and have the slave replicate from it. Better solutions should be worked on... [21:09:06] <_joe_> twentyafterfour: how badly? [21:09:15] fatalmonitor shows nothing but that error [21:09:18] pretty much [21:09:26] <_joe_> twentyafterfour: I think you should rollback then. [21:09:32] so 2 errors for every request [21:09:32] _joe_: +1 for https://gerrit.wikimedia.org/r/#/c/212444/ maybe? [21:09:41] <_joe_> it's probably a one-line fix [21:09:44] <_joe_> but still. [21:10:04] <_joe_> yuvipanda: yeah, it's 11 PM, :P [21:10:09] _joe_: :P ok [21:10:20] I should go too. Last evening in New York [21:10:21] brb [21:11:10] twentyafterfour: I think it might be a problem with the extension.json for SyntaxHighlight_GeSHi [21:11:38] anything that uses syntax highlighting might be broken :/ [21:11:41] legoktm: ^ [21:11:56] I'm guessing this is the problem -- https://github.com/wikimedia/mediawiki-extensions-SyntaxHighlight_GeSHi/blob/master/extension.json#L51 [21:12:00] hi [21:12:03] okay here [21:12:07] aude: I'm trying to figure out what to revert [21:12:25] i don't know details exactly [21:12:42] SyntaxHighlight_GeSHi.langs.php should be setting it [21:12:55] bd808: ok let me see... [21:13:09] 7Puppet, 6operations, 5Interdatacenter-IPsec: Puppet function: ipresolve: throw an error if lookup fails, refactor into wmflib - https://phabricator.wikimedia.org/T99833#1300061 (10Gage) p:5Triage>3Normal [21:13:22] perhaps that isn't getting included? [21:13:33] that's what I'm worndering [21:13:36] twentyafterfour: https://github.com/wikimedia/mediawiki-extensions-SyntaxHighlight_GeSHi/commit/72d1e9226365959a7fff1ee429787b9ed52e751c [21:13:39] so [21:13:44] the 'callback' should do it [21:13:56] except I think that might even be too late. [21:14:32] (03CR) 10Ottomata: [WIP] Add parallel kafka pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/210765 (https://phabricator.wikimedia.org/T98779) (owner: 10Milimetric) [21:14:46] patch incoming [21:15:58] (03PS1) 10Legoktm: Set $wgGeSHiSupportedLanguages directly instead of array_intersect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212449 [21:16:03] twentyafterfour: ^ [21:16:40] I don't know if I should be happy or sad that beta cluster shows the same problem [21:16:42] (03CR) 1020after4: [C: 032] Set $wgGeSHiSupportedLanguages directly instead of array_intersect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212449 (owner: 10Legoktm) [21:16:43] https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor [21:16:48] (03Merged) 10jenkins-bot: Set $wgGeSHiSupportedLanguages directly instead of array_intersect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212449 (owner: 10Legoktm) [21:17:18] we need some sort of beta cluster log alerting? ;) [21:17:25] apparently [21:17:26] like threshold alerts [21:17:26] hmmm [21:17:46] we need the same thing for several problems we see in prod too [21:17:54] well, this is going to stop the warnings, but I think it's going to end up re-enabling all languages [21:18:02] since we're going to override the array in the callback [21:18:23] twentyafterfour: um, are you deploying it? [21:18:30] or should I? [21:18:30] yes [21:18:34] ok [21:18:37] !log twentyafterfour Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 14s) [21:18:43] Logged the message, Master [21:19:13] looks better [21:19:43] at least that error message is gone [21:19:49] > var_dump(count($wgGeSHiSupportedLanguages)); [21:19:49] int(142) [21:19:57] legoktm: Does wfLoadExtension() make the settings available immediately? [21:20:03] bd808: no [21:20:19] hence why we do this weird backwards merging thing [21:20:26] but this one is weird [21:20:26] *nod* thus the undefined var problem [21:21:19] everything in wmf-config is weird ;) [21:21:43] But I get that this one is weird even for the land of weird [21:21:52] bizaroworld [21:22:01] twentyafterfour: I'm reverting the commit to syntaxhighlight_geshi, my temp fix enabled all languages which is going to be a perf regression [21:22:10] ok [21:25:39] !log legoktm Synchronized php-1.26wmf7/extensions/SyntaxHighlight_GeSHi: https://gerrit.wikimedia.org/r/#/c/212450/ (duration: 00m 13s) [21:25:45] Logged the message, Master [21:33:02] 6operations, 10Traffic: Sanitize varnish director-level retries - https://phabricator.wikimedia.org/T99839#1300131 (10BBlack) 3NEW a:3BBlack [21:41:02] 6operations, 10Traffic: Sanitize varnish director-level retries - https://phabricator.wikimedia.org/T99839#1300160 (10BBlack) I should note there is some risk upping the retries to `$backend_weight_avg * $num_backends` in the currently non-conforming chash cases (text/upload chashes): in a scenario where a lar... [21:46:13] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1300177 (10RobH) a:3akosiaris @Yurik: I think we have the question's I sent to you covered now, in partic... [21:46:28] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1300179 (10Ricordisamoa) I am not fond of the -l suffix either, but breaking mail clients... meh. [21:46:38] bblack: ^ what's a chach? typo? [21:46:55] chash [21:46:58] director [21:47:10] er yeah. i can't type. [21:47:13] consistent hashing director [21:47:25] consistently maps URLs to backends [21:47:43] thanks [21:50:20] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1300210 (10Ottomata) 3NEW [21:50:43] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1300219 (10Ottomata) [21:51:39] !log ori Synchronized php-1.26wmf6/includes: I32a3cfabc: Made pushLazyJobs() handle all queue groups (duration: 00m 18s) [21:51:47] Logged the message, Master [21:52:18] ori, you will need to revert the https://gerrit.wikimedia.org/r/#/c/211926/ cherry picks to really test that [21:54:22] ottomata: hmm weird, i'll take a look [21:57:02] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:10:44] gah salt hostname globbing. http://docs.saltstack.com/en/latest/topics/targeting/globbing.html says "salt 'web[1-5]' test.ping", so why does "salt 'analytics10[35-38].eqiad.wmnet' cmd.run 'uptime'" say No minions matched the target? [22:11:19] (03PS1) 10Yuvipanda: wmflib: Move ipresolve function into wmflib [puppet] - 10https://gerrit.wikimedia.org/r/212464 (https://phabricator.wikimedia.org/T99833) [22:13:45] (03PS1) 10BBlack: Revert "Add storage cfg for cp1099 T96873" [puppet] - 10https://gerrit.wikimedia.org/r/212465 [22:14:46] !log twentyafterfour Purged l10n cache for 1.26wmf5 [22:14:52] Logged the message, Master [22:17:33] (03CR) 10BBlack: [C: 032] Revert "Add storage cfg for cp1099 T96873" [puppet] - 10https://gerrit.wikimedia.org/r/212465 (owner: 10BBlack) [22:30:11] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1300301 (10Dereckson) @yann @Steinsplitter : [22:33:59] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1300305 (10Dereckson) @yann @Steinsplitter would you have any constructive comment instead about the task in addition to repeat it blocks things? We already got security comments, but you didn't... [22:49:13] PROBLEM - puppet last run on cp3037 is CRITICAL puppet fail [23:00:04] RoanKattouw, ^d, kaldari: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150520T2300). Please do the needful. [23:01:03] o/ [23:01:17] I'll do it [23:01:21] kaldari: ping? [23:01:31] howdy [23:02:04] legoktm: shit, you're going to ask me if I have PM approval again aren't you? :) [23:02:52] where is JonKatz when I need him [23:02:54] kaldari: want me to pretend to be your PM? [23:03:02] "I approve this" [23:03:08] yay :) [23:03:14] I used to be a mobile PM [23:03:18] I approved things and stuff [23:03:33] Deskana: can I turn off WikiGrok on en.wiki? [23:03:37] What's the thing that needs approval? [23:03:48] kaldari: Does Jon know? [23:03:51] Jon Katz [23:03:54] (03PS1) 10Ori.livneh: Add a script to disable Puppet temporarily [puppet] - 10https://gerrit.wikimedia.org/r/212475 [23:04:05] kaldari: lol no [23:04:05] bblack, _joe_ ^ [23:04:13] Deskana: No, I just like to turn things on and off [23:04:16] JK [23:04:20] yes, he knows [23:04:22] Yes, JK. Jon Katz. [23:04:23] (03CR) 10Legoktm: [C: 032] Disable WikiGrok in WMF production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212017 (https://phabricator.wikimedia.org/T98142) (owner: 10Florianschmidtwelzow) [23:04:30] (03Merged) 10jenkins-bot: Disable WikiGrok in WMF production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212017 (https://phabricator.wikimedia.org/T98142) (owner: 10Florianschmidtwelzow) [23:04:32] * Deskana is messing with you too [23:04:44] Sounds fine to me then! [23:04:47] h.oo's on first? [23:04:53] Now let's get the other PMs in here to approve [23:04:57] I'll phone Maryana and Howie [23:05:02] Approval party [23:05:04] bd808: I see what you did there. [23:05:36] !log legoktm Synchronized wmf-config/: Disable WikiGrok in WMF production (duration: 00m 13s) [23:05:44] kaldari: ^ [23:05:44] Logged the message, Master [23:05:47] cehcking... [23:05:49] ori: ≧◡≦ i r so cute today [23:06:03] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [23:07:12] legoktm: looks good. Thanks!@ [23:07:26] !log legoktm Synchronized php-1.26wmf7/extensions/SyntaxHighlight_GeSHi/: https://gerrit.wikimedia.org/r/212456 (duration: 00m 14s) [23:07:32] Logged the message, Master [23:08:37] ok, all done :D [23:09:23] PROBLEM - Apache HTTP on mw1208 is CRITICAL - Socket timeout after 10 seconds [23:09:32] PROBLEM - HHVM rendering on mw1208 is CRITICAL - Socket timeout after 10 seconds [23:09:44] i'll restart mw1208 [23:09:46] it's the usual lock-up [23:10:53] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.065 second response time [23:11:03] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 65469 bytes in 0.615 second response time [23:12:45] I'm getting the weirdest stuff happening on mediawiki.org... [23:12:53] PROBLEM - HHVM queue size on mw1208 is CRITICAL 57.14% of data above the critical threshold [80.0] [23:13:13] PROBLEM - HHVM busy threads on mw1208 is CRITICAL 57.14% of data above the critical threshold [115.2] [23:13:15] Someone look at Help:Navigation and see if its RTL for them [23:13:46] logged in or out? [23:13:53] in [23:14:10] what language? [23:14:27] oh wait [23:14:40] hrm must be caching nm [23:15:13] uhh [23:15:22] redirection [23:15:28] yes it was ar [23:16:30] ok [23:17:53] RECOVERY - HHVM queue size on mw1208 is OK Less than 30.00% above the threshold [10.0] [23:18:13] RECOVERY - HHVM busy threads on mw1208 is OK Less than 30.00% above the threshold [76.8] [23:19:23] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1300440 (10Gage) This is mysterious. I've compared the problem host, analytics1036 (ge-2/0/5), with healthy analytics1035 (ge-2/0/4), which sits right next to it in rack D-2. * Confirmed the problem: ** Ne... [23:21:42] ^ any network heads wanna take a look at this? not critical, but quite puzzling. [23:21:56] (03PS2) 10Ori.livneh: Add a script to disable Puppet temporarily [puppet] - 10https://gerrit.wikimedia.org/r/212475 [23:26:10] (03CR) 10Ori.livneh: [C: 032 V: 032] "Moved this to my dotfiles" [puppet] - 10https://gerrit.wikimedia.org/r/212475 (owner: 10Ori.livneh) [23:29:12] What's a PM? [23:30:24] Negative24: primate message? [23:30:29] oh [23:30:32] product manager, in this context [23:31:01] or "pompous meddler", if you're cheeky [23:31:16] ah thanks [23:31:20] hehe primate message [23:31:26] <*ori*> OOK OOK [23:31:48] * Negative24 just say that :D [23:31:50] oops [23:31:52] *saw [23:42:33] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:42:54] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:43:33] (03PS1) 10Yurik: Beta: updated graphoid to the new api endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212480 [23:43:37] gwicke, ^ [23:44:13] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:44:33] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [23:46:23] yurik: you might want to hold off on that until the labs entry point actually works [23:46:40] gwicke, i won't merge until you say its ready [23:46:43] just CCed you in a mail to Marko, don't want to override that without asking [23:46:50] and won't merge to prod until tested :) [23:46:59] i could test on some prod wiki though [23:47:02] like testwiki [23:47:04] but meh [23:47:08] kk