[00:01:28] There's probably numerous log lines that can just be removed altogether [00:02:44] It's probably very handy when you are working on the code [00:02:52] just not so much for operational logging [00:03:43] bd808: this specific issue is a manifestation of a bigger problem, which is that the job queue stack is very unintuitive to maintainers [00:03:45] This one job created 3200 log events -- https://logstash.wikimedia.org/#dashboard/temp/AVFBGhvUptxhN1XaqiFO [00:03:52] so I would encourage you to be bold and delete things that don't seem useful to you [00:05:31] (03PS1) 10Ori.livneh: add the redis instance on rdb1007:6381 to queue servers config [puppet] - 10https://gerrit.wikimedia.org/r/255475 [00:05:34] 6operations, 6Release-Engineering-Team: Rewrite http://download.wikimedia.org/mediawiki/ -> https://releases.wikimedia.org/mediawiki in less than 3 redirects - https://phabricator.wikimedia.org/T119679#1832891 (10Reedy) I should note that I've fixed the script that makes some of the release announcement emails... [00:06:00] (03CR) 10Ori.livneh: [C: 032 V: 032] add the redis instance on rdb1007:6381 to queue servers config [puppet] - 10https://gerrit.wikimedia.org/r/255475 (owner: 10Ori.livneh) [00:06:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:06:46] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: puppet fail [00:08:29] Filed as https://phabricator.wikimedia.org/T119682 if anyone wants to go splunking in the OCG source to see if it uses any severity other than info. [00:11:58] (03PS1) 10Ori.livneh: pool rdb1007:6381 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255480 [00:12:15] (03PS2) 10Ori.livneh: pool rdb1007:6381 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255480 [00:12:29] (03CR) 10Ori.livneh: [C: 032] pool rdb1007:6381 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255480 (owner: 10Ori.livneh) [00:12:51] (03Merged) 10jenkins-bot: pool rdb1007:6381 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255480 (owner: 10Ori.livneh) [00:14:14] !log ori@tin Synchronized wmf-config/jobqueue-eqiad.php: I1e0d0ed8215f: pool rdb1007:6381 (duration: 00m 30s) [00:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:14:58] AaronSchulz: ^ fyi [00:15:19] 6operations, 6Phabricator, 10netops: Fix edit permissions of the netops project - https://phabricator.wikimedia.org/T119634#1832913 (10Krenair) >>! In T119634#1831776, @faidon wrote: > it's probably a remnant from the default policy from projects coming off from the RT migration. I just reset it to the defau... [00:22:16] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [00:32:57] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:41:37] (03PS1) 10Ori.livneh: Migrate rdb1004 to jobqueue_redis role [puppet] - 10https://gerrit.wikimedia.org/r/255483 [00:45:32] (03CR) 10Ori.livneh: [C: 032] Migrate rdb1004 to jobqueue_redis role [puppet] - 10https://gerrit.wikimedia.org/r/255483 (owner: 10Ori.livneh) [00:50:26] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:52:36] PROBLEM - puppet last run on rdb1004 is CRITICAL: CRITICAL: puppet fail [00:54:36] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:15:26] (03PS1) 10Ori.livneh: Migrate rdb1003 to jobqueue_redis role [puppet] - 10https://gerrit.wikimedia.org/r/255490 [01:16:36] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 63 failures [01:19:26] (03CR) 10Ori.livneh: [C: 04-1] "Should be good to go, but must be accompanied by some manual work: replace /etc/redis/redis.conf with a copy of /etc/redis/redis.conf.samp" [puppet] - 10https://gerrit.wikimedia.org/r/255490 (owner: 10Ori.livneh) [01:43:17] PROBLEM - HHVM rendering on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:25] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:27] PROBLEM - configured eth on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:44:37] PROBLEM - dhclient process on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:44:46] PROBLEM - DPKG on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:44:47] PROBLEM - HHVM processes on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:44:56] PROBLEM - Disk space on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:45:16] PROBLEM - nutcracker process on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:45:55] PROBLEM - RAID on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:46:05] PROBLEM - salt-minion processes on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:46:06] PROBLEM - SSH on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:46:16] PROBLEM - nutcracker port on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:46:25] PROBLEM - Check size of conntrack table on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:47:46] RECOVERY - RAID on mw1146 is OK: OK: no RAID installed [01:47:56] RECOVERY - salt-minion processes on mw1146 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:53:46] PROBLEM - RAID on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:56] PROBLEM - salt-minion processes on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:55:45] RECOVERY - salt-minion processes on mw1146 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:55:55] RECOVERY - SSH on mw1146 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:55:56] RECOVERY - nutcracker port on mw1146 is OK: TCP OK - 0.000 second response time on port 11212 [01:56:05] RECOVERY - Check size of conntrack table on mw1146 is OK: OK: nf_conntrack is 0 % full [01:56:15] RECOVERY - configured eth on mw1146 is OK: OK - interfaces up [01:56:25] RECOVERY - dhclient process on mw1146 is OK: PROCS OK: 0 processes with command name dhclient [01:56:26] RECOVERY - DPKG on mw1146 is OK: All packages OK [01:56:27] RECOVERY - HHVM processes on mw1146 is OK: PROCS OK: 6 processes with command name hhvm [01:56:37] RECOVERY - Disk space on mw1146 is OK: DISK OK [01:56:57] RECOVERY - nutcracker process on mw1146 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:57:35] RECOVERY - RAID on mw1146 is OK: OK: no RAID installed [02:00:16] !log l10nupdate@tin LocalisationUpdate failed: git pull of core failed [02:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:11:56] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1833191 (10GWicke) @fgiunchedi: I created T119659 to get SSDs for 100[7-9], which should let us move forward without running out of space. [02:33:56] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [5000000.0] [02:43:46] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0] [02:55:35] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [03:00:29] 6operations, 6Release-Engineering-Team: Rewrite http://download.wikimedia.org/mediawiki/ -> https://releases.wikimedia.org/mediawiki in less than 3 redirects - https://phabricator.wikimedia.org/T119679#1833203 (10Krinkle) I guess this related to the trailing slash redirect being handled by Apache, which respon... [03:22:13] (03PS1) 10Andrew Bogott: Remove labs_ldap_dns_ip_override hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/255495 [03:40:15] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48356 bytes in 7.255 second response time [03:41:06] PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: Puppet has 1 failures [03:46:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:06:25] RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [04:42:26] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48356 bytes in 9.739 second response time [04:44:57] PROBLEM - HHVM rendering on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:45:07] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:47:16] PROBLEM - configured eth on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:47:17] PROBLEM - Check size of conntrack table on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:47:26] PROBLEM - dhclient process on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:47:36] PROBLEM - SSH on mw1115 is CRITICAL: Server answer [04:47:36] PROBLEM - DPKG on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:47:37] PROBLEM - nutcracker port on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:47:49] PROBLEM - Disk space on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:47:56] PROBLEM - nutcracker process on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:15] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [04:48:15] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:48:16] PROBLEM - salt-minion processes on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:16] PROBLEM - HHVM processes on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:26] PROBLEM - puppet last run on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:37] PROBLEM - RAID on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:49:46] RECOVERY - Disk space on mw1115 is OK: DISK OK [04:52:05] RECOVERY - salt-minion processes on mw1115 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:52:06] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1686 bytes in 0.014 second response time [04:52:16] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 39 minutes ago with 0 failures [04:52:26] RECOVERY - RAID on mw1115 is OK: OK: no RAID installed [04:52:46] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.795 second response time [04:52:56] RECOVERY - configured eth on mw1115 is OK: OK - interfaces up [04:53:05] RECOVERY - Check size of conntrack table on mw1115 is OK: OK: nf_conntrack is 0 % full [04:53:15] RECOVERY - dhclient process on mw1115 is OK: PROCS OK: 0 processes with command name dhclient [04:53:16] RECOVERY - DPKG on mw1115 is OK: All packages OK [04:53:25] RECOVERY - nutcracker port on mw1115 is OK: TCP OK - 0.000 second response time on port 11212 [04:53:26] RECOVERY - SSH on mw1115 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [04:53:45] RECOVERY - nutcracker process on mw1115 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:53:56] RECOVERY - HHVM processes on mw1115 is OK: PROCS OK: 6 processes with command name hhvm [04:54:36] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 64874 bytes in 1.960 second response time [05:04:03] (03CR) 10Andrew Bogott: "puppet compiler approves" [puppet] - 10https://gerrit.wikimedia.org/r/255495 (owner: 10Andrew Bogott) [05:56:50] (03CR) 10Santhosh: [C: 04-1] CX: Use ContentTranslationRESTBase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [06:15:26] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48356 bytes in 9.190 second response time [06:21:16] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:30:44] (03CR) 10Ori.livneh: [C: 032 V: 032] Migrate rdb1003 to jobqueue_redis role [puppet] - 10https://gerrit.wikimedia.org/r/255490 (owner: 10Ori.livneh) [06:31:06] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:46] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:46] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:46] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:47] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:46] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [06:39:36] (03PS1) 10Ori.livneh: Declare the two new redis instances on rdb1003 to the jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/255498 [06:40:03] (03CR) 10Ori.livneh: [C: 032 V: 032] Declare the two new redis instances on rdb1003 to the jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/255498 (owner: 10Ori.livneh) [06:41:03] (03PS1) 10Ori.livneh: Update job queue config to use 2 new rdb1003 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255499 [06:41:18] (03PS2) 10Ori.livneh: Update job queue config to use 2 new rdb1003 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255499 [06:42:31] (03PS3) 10KartikMistry: CX: Use ContentTranslationRESTBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) [06:43:08] (03CR) 10jenkins-bot: [V: 04-1] CX: Use ContentTranslationRESTBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [06:43:27] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [06:43:56] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:44:17] (03CR) 10Ori.livneh: [C: 032] Update job queue config to use 2 new rdb1003 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255499 (owner: 10Ori.livneh) [06:44:38] (03Merged) 10jenkins-bot: Update job queue config to use 2 new rdb1003 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255499 (owner: 10Ori.livneh) [06:45:58] !log ori@tin Synchronized wmf-config/jobqueue-eqiad.php: I3e6021ed3: Update job queue config to use 2 new rdb1003 instances (duration: 00m 28s) [06:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:47:11] (03PS4) 10KartikMistry: CX: Use ContentTranslationRESTBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) [06:47:36] (03PS1) 10Ori.livneh: migrate rdb1001 and rdb1002 to jobqueue_redis role [puppet] - 10https://gerrit.wikimedia.org/r/255500 [06:50:52] (03CR) 10Ori.livneh: [C: 032] migrate rdb1001 and rdb1002 to jobqueue_redis role [puppet] - 10https://gerrit.wikimedia.org/r/255500 (owner: 10Ori.livneh) [06:54:36] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: puppet fail [06:55:09] <_joe_> ori: ^^ should I take a look? [06:55:19] nope, i'm migrating it [06:55:21] should be done in a sec [06:55:44] <_joe_> cool [06:56:26] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:57:12] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 {channel:frontend.error,request:{id:1448521023363-93262},error:{message:Status check failed (redis failure?)}} - 232 bytes in 0.006 second response time [06:58:13] ori: ^ was this also maybe using one of the redises accidentally? [06:58:27] no, i think it's a coincidence [06:58:33] worth checking [06:58:36] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:45] _joe_: can you take a look or want me to? [06:58:48] apergos: ^ [06:58:52] I see it [06:58:55] ok! [06:59:03] <_joe_> YuviPanda: I'm looking [06:59:10] thanks [06:59:15] * YuviPanda kicks back and drinks some tea [06:59:15] <_joe_> ocg does use the production redises [06:59:18] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.016 second response time [06:59:42] and there's the recovery too [07:00:04] _joe_: what do you mean, rdb100x? [07:00:31] <_joe_> rdb1002 specifically [07:00:48] <_joe_> and yeah, it was related to the change, but was just a flap [07:01:00] ah, damn it [07:01:01] i missed that [07:01:02] hieradata/common/ocg.yaml:2:redis_host: "rdb1002.eqiad.wmnet" [07:01:07] why does it use some random jobqueue redis? [07:01:10] that's awful [07:01:17] _joe_: does it at least use it's own instance or the default port? [07:01:27] it must use the default port [07:01:30] ... [07:01:31] because multi-instance is brand new [07:01:32] * YuviPanda doesn't want to share ores' redis for similar reasons [07:01:50] <_joe_> AaronSchulz: I'm going to look at the config now [07:01:57] <_joe_> because yeah, I'm baffled [07:02:55] how is that a good idea [07:03:12] <_joe_> ori: I guess that was moved when rdb1001 failed [07:03:21] <_joe_> and we failed to move it back [07:03:25] how was using rdb1001 a good idea [07:03:33] rdb1002 is esp crazy because it's a slave [07:03:55] <_joe_> it was a good idea given ocg gets its data from mediawiki? [07:04:06] oh wow. so it only worked since we didn't make slaves readonly?! [07:04:10] <_joe_> and I think mediawiki just enqueues pdf renderings there? [07:04:24] YuviPanda: yes [07:04:34] that bit me in tools once too [07:04:47] I was going to use the redis puppet move to fix that and make slaves readonly [07:04:49] (03PS5) 10KartikMistry: CX: Use ContentTranslationRESTBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) [07:05:16] well good you didn't! [07:05:39] I was going to do it for tools :) [07:05:41] not for prod [07:05:43] ah [07:05:59] <_joe_> anyways, the error here is we don't use service CNAMES like "redis-master-s1" [07:06:06] ^ [07:06:10] <_joe_> and we move things around in the config instead [07:06:24] for ores we fake it with a puppetized /etc/hosts entry [07:06:29] ghetttoooo DNS [07:06:34] <_joe_> EWWWWWW [07:06:38] yeah, is terrible [07:06:40] no, i think it's worse than that [07:06:41] <_joe_> YuviPanda: not in production I hope [07:06:51] _joe_: nah in production we actually have DNS servers we can control :) [07:06:57] the job queue setup is complicated and fragile and (until very recently) highly overloaded [07:07:26] _joe_: for labs I wanted to make sure halfak can failover himself if I wasn't around so anything requiring a patch to ops/puppet was boo. this can be done just via hiera. [07:07:26] rdb1001 is an aggregator, too [07:07:30] <_joe_> ori: ocg on the other hand uses very little resources, it's just a true queue in the redis sense [07:07:32] so jobqueue aggregator, queue redis, and ocw [07:07:37] *ocg [07:07:38] * YuviPanda wonders if ocg was contributing to it [07:07:41] <_joe_> ori: an aggregator of what? [07:07:52] <_joe_> YuviPanda: no it was not [07:08:03] <_joe_> anyways, ocg uses rdb1002 :/ [07:08:05] the job queue has redis aggregators and redis queue servers [07:08:10] <_joe_> we got to change that [07:11:43] (03PS1) 10Ori.livneh: Declare the two new redis instances on rdb1001 to the jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/255503 [07:11:49] (03PS1) 10Aaron Schulz: Add comment about partition weights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255504 [07:13:26] (03PS1) 10Ori.livneh: Update job queue config to use 2 new rdb1001 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255505 [07:13:37] (03CR) 10Ori.livneh: [C: 032] Declare the two new redis instances on rdb1001 to the jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/255503 (owner: 10Ori.livneh) [07:15:55] (03CR) 10Aaron Schulz: Update job queue config to use 2 new rdb1001 instances (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255505 (owner: 10Ori.livneh) [07:15:57] !log rdb100x migrated to redis::instance, meaning they each have two additional redis instances, on port 6380 and 6381, accordingly. [07:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:22:18] (03CR) 10Ori.livneh: [C: 032] Add comment about partition weights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255504 (owner: 10Aaron Schulz) [07:22:39] (03Merged) 10jenkins-bot: Add comment about partition weights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255504 (owner: 10Aaron Schulz) [07:23:42] (03PS2) 10Ori.livneh: Update job queue config to use 2 new rdb1001 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255505 [07:23:55] (03CR) 10Ori.livneh: [C: 032] Update job queue config to use 2 new rdb1001 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255505 (owner: 10Ori.livneh) [07:24:16] (03Merged) 10jenkins-bot: Update job queue config to use 2 new rdb1001 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255505 (owner: 10Ori.livneh) [07:25:21] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:32:20] (03PS1) 10Ori.livneh: Migrate codfw jobqueue redises to jobqueue_redis role [puppet] - 10https://gerrit.wikimedia.org/r/255506 [07:37:00] (03CR) 10Ori.livneh: [C: 032] Migrate codfw jobqueue redises to jobqueue_redis role [puppet] - 10https://gerrit.wikimedia.org/r/255506 (owner: 10Ori.livneh) [07:39:21] 6operations, 10Traffic: cp4007 crashed - https://phabricator.wikimedia.org/T117746#1833323 (10MoritzMuehlenhoff) Agreeed. Also, the cp* hosts will be updated to 4.3 anyway. [07:44:17] (03PS1) 10Ori.livneh: Goodbye to role::redisdb [puppet] - 10https://gerrit.wikimedia.org/r/255507 [07:44:59] ori: awww, less than a week after I moved it from role::db::redis [07:45:00] nice [07:45:14] ori: but: remember removing it needs cleanup on labs ldap too since deployment-prep uses it too [07:45:23] ohh [07:45:25] otherwise it'll cause puppet failure and other bad things [07:45:25] i always forget [07:45:29] ofc :P [07:45:59] yuvipanda: so how do i do that? [07:46:06] just via the configure interface on wikitech? [07:46:12] ori: yeah that might be simplest [07:46:28] I edit LDAP directly for mass changes but in this case I think it's only one or two instances... [07:46:29] <_joe_> ugh [07:46:58] this is not a nice setup [07:47:13] which of the many things that aren't nice are you talking about? [07:47:29] i'm not even sure myself [07:47:38] (also this is why it took me a long time to move roles around and rename them) [07:47:40] vague aura of badness permeating everything [07:47:42] yeah [07:47:46] i know, not blaming you! [07:47:51] yeah :) [07:47:56] ori: horizon.wikimedia.org! SOMEDAY! [07:48:02] (it doesn't do puppet groups yet) [07:48:09] (or any puppet integration at all) [07:48:15] but is an overall nice interface otherwise [07:48:31] I'd also prefer node definitions for deployment-prep / tools be in operations/puppet but that's for another day [07:49:16] <_joe_> yuvipanda: so I am preaching we need a nice ENC in production and you want to move things to ops/puppet instead? [07:50:09] _joe_: no, not to site.pp [07:50:14] yuvipanda: how do i add a role to the config interface? [07:50:14] an ENC that's in (a git repo) [07:50:24] ori: there's 'manage puppet groups' on left sidebar [07:50:27] oh right [07:50:36] ori: there'll be a tantalizing 'modify' link next to role names. [07:50:45] as can be expected from OSM, it is only there to frustrate [07:50:47] don't click it? [07:50:48] heh [07:50:49] heheh [07:50:50] I won't tell you what it does [07:50:55] but try and be frustrated! [07:51:02] <_joe_> DON'T CLICK ON THAT LINK! [07:51:05] it'll have options to do the most useless thing you can think of [07:51:10] but not the most obvious thing you'd want to do [07:51:48] (it allows you to change the group that role is in(?!!?) but not rename or edit the role name itself!) [07:59:07] (03CR) 10Ori.livneh: [C: 032] "I migrated deployment-redis0[1-2] in beta and deleted the class from Puppet ldap on labs" [puppet] - 10https://gerrit.wikimedia.org/r/255507 (owner: 10Ori.livneh) [07:59:30] ori: thanks :) [07:59:38] thanks for the reminder [08:00:50] np! [08:01:01] ori: should I or should I not tell you that the staging project also has redises? :) [08:01:17] (that's the partially complete deployment-prep replacement) [08:01:42] ori: http://tools.wmflabs.org/watroles/role/role::redisdb [08:01:45] ori: and a bunch of other instances to [08:01:54] well, just totally 3 instances left [08:02:25] ugh [08:06:36] i'll update those labs instances in a moment [08:11:20] PROBLEM - puppet last run on mw2063 is CRITICAL: CRITICAL: Puppet has 1 failures [08:25:09] 6operations, 7Database, 5Patch-For-Review: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1833333 (10jcrespo) 5Open>3Resolved Thanks, @MaxSem. I think this ticket will serve as a checklist of all things needed for a full table deprecation: productio... [08:32:48] redisdb now fully dead [08:32:53] the role, i mean [08:34:50] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [08:36:40] RECOVERY - puppet last run on mw2063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:50] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48363 bytes in 8.299 second response time [08:46:39] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [08:49:50] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:04] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1833351 (10jcrespo) I see some concerning issues (connection problems) with db1053 that may or may not be related,... [08:59:41] (03PS1) 10Jcrespo: Depooling db1053 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255509 [09:00:33] (03CR) 10Jcrespo: [C: 032] Depooling db1053 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255509 (owner: 10Jcrespo) [09:02:29] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1053 for maintenance (duration: 00m 27s) [09:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:04:06] (03PS2) 10Muehlenhoff: Assign openldap::labs role to seaborgium/serpens [puppet] - 10https://gerrit.wikimedia.org/r/255373 [09:10:14] sadly, I have to kill the very same query I am trying to fix [09:18:07] 6operations, 6Release-Engineering-Team: Rewrite http://download.wikimedia.org/mediawiki/ -> https://releases.wikimedia.org/mediawiki in less than 3 redirects - https://phabricator.wikimedia.org/T119679#1833355 (10hashar) [09:20:35] (03PS1) 10Jcrespo: Set ferm and performance_schema for db1053 [puppet] - 10https://gerrit.wikimedia.org/r/255510 [09:22:09] (03CR) 10Yuvipanda: [C: 04-2] "Um, only somewhat unrelated but can this be labs::openldap and live in the role module please?" [puppet] - 10https://gerrit.wikimedia.org/r/255373 (owner: 10Muehlenhoff) [09:22:17] moritzm: ^ [09:22:29] I've been moving a bunch of stuff that way and don't want to have to move this right after it gets merged [09:22:34] not sure if that should be a -2 tho [09:22:36] * yuvipanda degrades to -1 [09:22:43] (03CR) 10Yuvipanda: [C: 04-1] Assign openldap::labs role to seaborgium/serpens [puppet] - 10https://gerrit.wikimedia.org/r/255373 (owner: 10Muehlenhoff) [09:25:30] (03PS2) 10Jcrespo: Set ferm and performance_schema for db1053 [puppet] - 10https://gerrit.wikimedia.org/r/255510 [09:25:57] yuvipanda: I made it intentionally similar to openldap::corp, I'd rather proceed with the status quo and move over both at some point [09:26:05] (03CR) 10Alexandros Kosiaris: "Oh, so this is not really a "labs" role. It's the LDAP servers that support labs. As in replacement for OpenDJ. Maybe "labs" is a bad nami" [puppet] - 10https://gerrit.wikimedia.org/r/255373 (owner: 10Muehlenhoff) [09:26:09] hehe [09:26:20] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [5000000.0] [09:26:34] akosiaris: moritzm so I've been moving the roles that support labs too to modules/role/manifests/labs [09:26:37] (03CR) 10Jcrespo: [C: 032] Set ferm and performance_schema for db1053 [puppet] - 10https://gerrit.wikimedia.org/r/255510 (owner: 10Jcrespo) [09:27:15] 6operations, 6Release-Engineering-Team: Rewrite http://download.wikimedia.org/mediawiki/ -> https://releases.wikimedia.org/mediawiki in less than 3 redirects - https://phabricator.wikimedia.org/T119679#1833363 (10hashar) Seems the wiki got updated as https://www.mediawiki.org/wiki/Download points to https://re... [09:27:18] I'm not really sure about corp but this is distinctly similar enough to the other stuff in labs/ that I feel reasonably strongly about it [09:31:37] (03CR) 10Muehlenhoff: "Faidon suggested limbo :-)" [puppet] - 10https://gerrit.wikimedia.org/r/255373 (owner: 10Muehlenhoff) [09:33:35] (03CR) 10Yuvipanda: "so many temporary hacks left littered around, let's hope this one fares better :)" [puppet] - 10https://gerrit.wikimedia.org/r/255373 (owner: 10Muehlenhoff) [09:38:57] !log upgrading/rebooting/restarting mysql on db1053 (depooled) [09:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:40:56] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:40:58] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1833382 (10fgiunchedi) @gwicke 1007/1008/1009 have been provisioned on purpose as proportionally smaller machines, we could add ssd there but we m... [09:48:26] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48356 bytes in 7.133 second response time [09:49:58] !log restarted cassandra nodes on the restbase staging cluster to pick up openjdk security updates [09:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:38] PROBLEM - Restbase endpoints health on restbase-test2002 is CRITICAL: /page/mobile-sections/{title} is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [09:54:27] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:55:28] moritzm: ^ taking a look [09:59:56] RECOVERY - Restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [10:04:13] (03PS3) 10Muehlenhoff: Assign openldap::labs role to seaborgium/serpens [puppet] - 10https://gerrit.wikimedia.org/r/255373 [10:04:32] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign openldap::labs role to seaborgium/serpens [puppet] - 10https://gerrit.wikimedia.org/r/255373 (owner: 10Muehlenhoff) [10:09:56] (03CR) 10Alexandros Kosiaris: [C: 032] Bring etherpad-lite configuration up to date with upstream [puppet] - 10https://gerrit.wikimedia.org/r/255402 (owner: 10Alexandros Kosiaris) [10:10:01] (03PS2) 10Alexandros Kosiaris: Bring etherpad-lite configuration up to date with upstream [puppet] - 10https://gerrit.wikimedia.org/r/255402 [10:10:21] (03CR) 10Alexandros Kosiaris: [V: 032] Bring etherpad-lite configuration up to date with upstream [puppet] - 10https://gerrit.wikimedia.org/r/255402 (owner: 10Alexandros Kosiaris) [10:10:47] moritzm: I merged yours as well [10:10:56] lol/win 35 [10:10:58] er :) [10:11:07] wc [10:11:13] thanks, was just about to do it [10:12:03] (03CR) 10Alexandros Kosiaris: [C: 032] Set a standard defaultPadText for etherpad [puppet] - 10https://gerrit.wikimedia.org/r/255403 (owner: 10Alexandros Kosiaris) [10:12:08] (03PS2) 10Alexandros Kosiaris: Set a standard defaultPadText for etherpad [puppet] - 10https://gerrit.wikimedia.org/r/255403 [10:16:58] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: puppet fail [10:17:59] (03PS1) 10Muehlenhoff: Fix typo in module argument [puppet] - 10https://gerrit.wikimedia.org/r/255514 [10:18:24] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix typo in module argument [puppet] - 10https://gerrit.wikimedia.org/r/255514 (owner: 10Muehlenhoff) [10:20:40] (03PS3) 10Alexandros Kosiaris: Set a standard defaultPadText for etherpad [puppet] - 10https://gerrit.wikimedia.org/r/255403 [10:20:44] (03CR) 10Alexandros Kosiaris: [V: 032] Set a standard defaultPadText for etherpad [puppet] - 10https://gerrit.wikimedia.org/r/255403 (owner: 10Alexandros Kosiaris) [10:26:14] (03CR) 10Alexandros Kosiaris: [C: 04-2] "X-Forwared-For can be set by a client application and hence it is not trustable yet. Need to address this in the misc-web varnish layers f" [puppet] - 10https://gerrit.wikimedia.org/r/255404 (owner: 10Alexandros Kosiaris) [10:31:09] (03PS1) 10Muehlenhoff: Add certificates for opendj replacement servers [puppet] - 10https://gerrit.wikimedia.org/r/255515 [10:36:52] 6operations, 7Database, 5Patch-For-Review: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#1833435 (10jcrespo) [10:37:33] PROBLEM - Labs LDAP on serpens is CRITICAL: Could not bind to the LDAP server [10:43:52] (03PS1) 10Yuvipanda: k8s: Add ServiceAccount and ResourceQuota admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/255516 [10:48:42] ACKNOWLEDGEMENT - Labs LDAP on serpens is CRITICAL: Could not bind to the LDAP server Muehlenhoff In setup [10:48:42] ACKNOWLEDGEMENT - puppet last run on serpens is CRITICAL: CRITICAL: Puppet has 1 failures Muehlenhoff In setup [10:49:46] (03PS2) 10Yuvipanda: k8s: Add ServiceAccount and ResourceQuota admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/255516 [10:49:55] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Add ServiceAccount and ResourceQuota admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/255516 (owner: 10Yuvipanda) [10:49:59] (03CR) 10Alexandros Kosiaris: [C: 031] Add certificates for opendj replacement servers [puppet] - 10https://gerrit.wikimedia.org/r/255515 (owner: 10Muehlenhoff) [10:52:13] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48356 bytes in 6.658 second response time [10:53:53] (03PS2) 10Muehlenhoff: Add certificates for opendj replacement servers [puppet] - 10https://gerrit.wikimedia.org/r/255515 [10:54:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add certificates for opendj replacement servers [puppet] - 10https://gerrit.wikimedia.org/r/255515 (owner: 10Muehlenhoff) [10:58:31] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:59:13] (03PS1) 10Yuvipanda: Revert "k8s: Add ServiceAccount and ResourceQuota admission controllers" [puppet] - 10https://gerrit.wikimedia.org/r/255517 [10:59:20] (03PS2) 10Yuvipanda: Revert "k8s: Add ServiceAccount and ResourceQuota admission controllers" [puppet] - 10https://gerrit.wikimedia.org/r/255517 [10:59:29] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "k8s: Add ServiceAccount and ResourceQuota admission controllers" [puppet] - 10https://gerrit.wikimedia.org/r/255517 (owner: 10Yuvipanda) [11:00:32] (03PS1) 10Mdann52: Add portal namespace to ps.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255519 (https://phabricator.wikimedia.org/T119510) [11:00:58] (03CR) 10jenkins-bot: [V: 04-1] Add portal namespace to ps.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255519 (https://phabricator.wikimedia.org/T119510) (owner: 10Mdann52) [11:01:32] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures [11:02:04] (03PS2) 10Mdann52: Add portal namespace to ps.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255519 (https://phabricator.wikimedia.org/T119510) [11:02:12] (03PS1) 10Aaron Schulz: Adjust connection timeout and maxPartitionsTry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255520 [11:02:21] (03CR) 10jenkins-bot: [V: 04-1] Add portal namespace to ps.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255519 (https://phabricator.wikimedia.org/T119510) (owner: 10Mdann52) [11:04:25] (03PS3) 10Mdann52: Add portal namespace to ps.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255519 (https://phabricator.wikimedia.org/T119510) [11:06:09] '+channel:exception +message:*tried*' over 7d is kind of disturbing [11:13:04] (03PS1) 10Jcrespo: Repool db1053 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255524 [11:13:27] !log disabled puppet on neodymium for salt master testing [11:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:29] 6operations, 6Labs: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#1833513 (10yuvipanda) I've done this for most things, just a couple left (openldap::labs is a new and sad exception :() [11:15:15] (03CR) 10Jcrespo: [C: 032] Repool db1053 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255524 (owner: 10Jcrespo) [11:16:34] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1053 after maintenance (duration: 00m 28s) [11:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:19:21] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:21:40] 6operations, 10hardware-requests: spare swift disks order - https://phabricator.wikimedia.org/T119698#1833521 (10fgiunchedi) 3NEW [11:23:46] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/255517 (owner: 10Yuvipanda) [11:25:39] !log manually executing some updateSpecial pages script to try to debug its problems [11:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:32:42] (03PS1) 10ArielGlenn: increase zmq high water mark for salt master [puppet] - 10https://gerrit.wikimedia.org/r/255526 [11:32:45] !log drop local_group_globaldomain_T_mathoid_data and local_group_globaldomain_T_mathoid_request CF stats from graphite [11:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:33:57] (03CR) 10ArielGlenn: [C: 032] increase zmq high water mark for salt master [puppet] - 10https://gerrit.wikimedia.org/r/255526 (owner: 10ArielGlenn) [11:35:54] !log re-enabled puppet on neodymium, config testing done for now [11:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:36:37] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:38:17] PROBLEM - Labs LDAP on seaborgium is CRITICAL: Could not search/find objectclasses in dc=wikimedia,dc=org [11:39:19] 6operations, 7Availability: Set $wmfSwiftCodfwConfig in PrivateSettings - https://phabricator.wikimedia.org/T119651#1833549 (10fgiunchedi) ack, let's coordinate on passing on the credentials [11:40:14] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1833553 (10jcrespo) I've debugged it, and this is not a Database/infrastructure problem: Query starts: ``` ----... [11:40:57] ACKNOWLEDGEMENT - Labs LDAP on seaborgium is CRITICAL: Could not search/find objectclasses in dc=wikimedia,dc=org Muehlenhoff In setup [11:41:18] [ec54ff22] 2015-11-26 11:41:09: Fatal exception of type "JobQueueError" [11:41:19] whee [11:42:02] * Reedy files as a task [11:42:38] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1833557 (10ArielGlenn) After applying a patch locally, the above issue was fixed, but there were still 20 to 40 hosts besides the network-unreachable ones that failed to respond, according to t... [11:43:29] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [11:43:47] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [11:45:35] (03PS1) 10Filippo Giunchedi: diamond: send log to stdout on INFO [puppet] - 10https://gerrit.wikimedia.org/r/255528 [11:46:30] (03CR) 10Reedy: [C: 031] Adjust connection timeout and maxPartitionsTry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255520 (owner: 10Aaron Schulz) [11:46:36] known ^ (carbon-cache) [11:54:26] !log disabling puppet on neodymium for salt minon tracing [11:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:56:11] 6operations, 10Gitblit: git.wikimedia.org down: 504 Gateway Time-out - https://phabricator.wikimedia.org/T119701#1833595 (10zhuyifei1999) 3NEW [12:05:49] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1833614 (10ArielGlenn) I should add that besides the minion key auth events emitted by the master, there are also two events per test.ping with tags in the old and new (new for this version of... [12:25:15] 6operations, 7Database: HHVM default configuration kills queries after one minute, disable that - https://phabricator.wikimedia.org/T119704#1833665 (10jcrespo) 3NEW [12:26:46] 6operations, 7Database: HHVM default configuration kills queries after one minute, disable that - https://phabricator.wikimedia.org/T119704#1833680 (10hashar) [12:26:56] 6operations, 7Database: HHVM default configuration kills queries after one minute, disable that - https://phabricator.wikimedia.org/T119704#1833682 (10jcrespo) [12:34:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Various comments. Premise looks ok" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [12:38:41] 6operations, 7Database: HHVM default configuration kills queries after one minute, disable that - https://phabricator.wikimedia.org/T119704#1833687 (10hashar) [12:43:47] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [12:47:31] akosiaris, thx for looking at the repl patch! [12:50:24] yurik: yw [12:51:33] akosiaris, max had a lot of problems testing it on labs - apparently that patch is bloked by https://phabricator.wikimedia.org/T119541 [12:51:47] do you know anything about it? [12:52:13] yurik: nope. I know I setup a new self hosted puppetmaster environment just today and did not meet that problem [12:52:34] but yuvi does say it does not show up in all places anyway [12:55:57] akosiaris, so max can test his patch without waiting for this to be resolved? could you leave a note about it to let him know - i think he is waiting for the blocker [12:56:17] maybe, maybe not [12:56:22] depends on the instance as it seems like [12:56:34] not sure what that bug is... [12:57:27] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1833712 (10akosiaris) I setup a new self hosted puppetmaster environment today and I did not meet this problem. [12:59:17] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1833721 (10ArielGlenn) Currently looking at why the minion auths on every request. It should not need to request the new master aes key unless the key has been rotated or the master or minion... [13:15:17] 6operations, 7Database: HHVM default configuration kills queries after one minute, disable that - https://phabricator.wikimedia.org/T119704#1833740 (10jcrespo) [13:24:27] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48356 bytes in 9.457 second response time [13:25:38] !log stopping opendj on nembus for export of LDAP data [13:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:25:46] 6operations, 6WMDE-Analytics-Engineering, 10Wikidata, 5Patch-For-Review: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#1833771 (10Addshore) Any progress here? [13:27:20] !log restarted opendj on nembus for export of LDAP data [13:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:27] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:32] 6operations, 7Database: HHVM default configuration kills queries after one minute, disable that - https://phabricator.wikimedia.org/T119704#1833791 (10jcrespo) But we have already: ``` hhvm.mysql.slow_query_threshold = 10000 ``` [13:48:17] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [14:06:58] 6operations, 7Database: HHVM default configuration kills queries after one minute, disable that - https://phabricator.wikimedia.org/T119704#1833847 (10Joe) a:3Joe [14:07:32] (03PS1) 10Giuseppe Lavagetto: terbium: set hhvm mysql read timeout to 1 day [puppet] - 10https://gerrit.wikimedia.org/r/255538 (https://phabricator.wikimedia.org/T119704) [14:07:47] <_joe_> jynus: care to +1 ^^ [14:08:00] <_joe_> a question mark was missing [14:08:21] question mark? [14:08:28] <_joe_> "?" [14:08:37] <_joe_> jynus: care to +1 ^^? [14:08:37] i don't like to be questioned [14:08:48] <_joe_> we never question our Dear Leader [14:08:57] where? [14:09:08] <_joe_> https://gerrit.wikimedia.org/r/255538 [14:09:58] sorry, didn't understood you, was looking for a trinary operator somewhere on the code [14:10:14] (03CR) 10Jcrespo: [C: 032] terbium: set hhvm mysql read timeout to 1 day [puppet] - 10https://gerrit.wikimedia.org/r/255538 (https://phabricator.wikimedia.org/T119704) (owner: 10Giuseppe Lavagetto) [14:10:21] <_joe_> (I was tempted to continue the IRC comedy, but I am a grown up now!) [14:11:13] did you deploy already? [14:11:45] I can see yes [14:12:24] <_joe_> jynus: I'll try to run the script that was failing [14:12:29] <_joe_> if that's ok with you [14:12:38] screen [14:12:44] <_joe_> ok [14:12:52] some of them may take hours [14:13:22] <_joe_> let me first verify that the fix works even now that it's puppetized [14:13:44] I'm running test.php [14:14:37] <_joe_> thanks for nailing this down, I just have had no time for this :/ [14:14:40] <_joe_> my bad [14:14:45] and now works [14:14:49] <_joe_> :)) [14:14:50] Time: 65 [14:15:09] I actually feel dumb for not knowing that parameter [14:15:21] I suppose you have fought the source code more than me [14:15:22] <_joe_> it's not documented AFAIR [14:15:29] <_joe_> it's in the code, very well hidden [14:15:50] and I was obsessed with the logging of slow queries and the query killer [14:15:55] thinking it was mediawiki [14:16:07] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:19] 6operations, 7Database, 5Patch-For-Review: HHVM default configuration kills queries after one minute, disable that - https://phabricator.wikimedia.org/T119704#1833873 (10Joe) 5Open>3Resolved [14:17:10] let me write to users, giving you credit [14:17:42] <_joe_> jynus: what is the broken script that the user was complaining about? [14:18:09] <_joe_> Special:LonelyPages ? [14:18:10] looking [14:18:34] 2 users [14:18:38] one Special:DeadendPages [14:18:57] another BrokenRedirects hewiki [14:19:12] I can take care of that, do not worry if you have to do other things [14:19:34] (03PS1) 10Alexandros Kosiaris: tlsproxy::localssl: Force X-Forwarded-For to $remote_addr [puppet] - 10https://gerrit.wikimedia.org/r/255539 [14:19:46] <_joe_> no, this server's setup was my responsibility [14:19:58] yes, and databases are mine :-) [14:30:30] <_joe_> !log run /usr/local/bin/foreachwiki updateSpecialPages.php manually in a screen on terbium [14:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:57] (03CR) 10Faidon Liambotis: [C: 04-1] "Please explain why a) this is not a system user (and a system group), b) why a uid needs to be hardcoded and because (a)+(b), c) why is th" [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) (owner: 10Dzahn) [14:44:41] 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf3 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1833909 (10hashar) p:5High>3Normal [15:04:37] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48370 bytes in 6.445 second response time [15:10:36] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:48] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 10Traffic: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1833986 (10Stemoc) Apparently its impossible to download the over written version of the image unless you change the pixels and... [15:23:06] (03CR) 10Daniel Kinzler: [C: 04-1] "The config itself seems fine, but doesn't quite do what we want it to do (at least for me locally). Blocking this until we figure out what" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [15:24:28] (03PS1) 10Mark Bergsma: Add BGP MED support [debs/pybal] (bgp-med) - 10https://gerrit.wikimedia.org/r/255544 [15:31:07] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [15:46:16] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48356 bytes in 8.425 second response time [15:47:16] 6operations, 7Graphite: Make it easier to ban misbehaving dashboards from graphite - https://phabricator.wikimedia.org/T119718#1834004 (10fgiunchedi) 3NEW a:3fgiunchedi [15:47:54] 6operations, 7Graphite: Enforce a minimum refresh period for grafana dashboards hitting graphite - https://phabricator.wikimedia.org/T119719#1834012 (10fgiunchedi) 3NEW a:3fgiunchedi [15:48:27] 6operations, 7Graphite: 500 errors from graphite shouldn't be retried by varnish - https://phabricator.wikimedia.org/T119721#1834026 (10fgiunchedi) 3NEW a:3fgiunchedi [15:49:54] !log reenable puppet on graphite1001 [15:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:50:27] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:52:07] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:52:27] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:38] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:57] !log Deployed patch for T119707 [15:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:59:19] 6operations, 7Pybal: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#1834046 (10mark) >>! In T119372#1825044, @BBlack wrote: > So, it sounds like apache sends a RST after about 64 seconds? That's probably not... [16:04:53] 6operations, 7Graphite: Make it easier to ban misbehaving dashboards from graphite - https://phabricator.wikimedia.org/T119718#1834055 (10fgiunchedi) the idea being that in the event of a misbehaving/badly configured dashboard it should be quick/simple to blacklist it. we'll have to do this on the graphite si... [16:09:05] 6operations, 7Pybal: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#1834083 (10Joe) @mark we know for a fact it's the keepalive intervening, as if we don't set the keepalive the idleconnection will remain aliv... [16:10:05] 6operations, 7Pybal: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#1834084 (10mark) >>! In T119372#1834083, @Joe wrote: > @mark we know for a fact it's the keepalive intervening, as if we don't set the keepal... [16:22:02] !log rebooting lvs3003 with Linux 4.3-1~wmf1 [16:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:49] (03PS4) 10Paladox: Show more then 5 commits per repo page [puppet] - 10https://gerrit.wikimedia.org/r/250453 (https://phabricator.wikimedia.org/T117393) [16:23:05] (03PS3) 10Paladox: Re enable tags [puppet] - 10https://gerrit.wikimedia.org/r/250449 [16:24:38] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [16:25:29] RECOVERY - Host lvs3003 is UP: PING OK - Packet loss = 0%, RTA = 88.16 ms [16:30:24] !log switch lvs3001 traffic to lvs3003dd [16:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:31] s/dd$// :) [16:33:05] I thought we had some new naming scheme [16:35:02] (03PS1) 10Ottomata: Use --until for eventlogging raw vs validated check [puppet] - 10https://gerrit.wikimedia.org/r/255550 (https://phabricator.wikimedia.org/T116035) [16:35:53] (03CR) 10jenkins-bot: [V: 04-1] Use --until for eventlogging raw vs validated check [puppet] - 10https://gerrit.wikimedia.org/r/255550 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [16:37:23] (03PS2) 10Ottomata: Use --until for eventlogging raw vs validated check [puppet] - 10https://gerrit.wikimedia.org/r/255550 (https://phabricator.wikimedia.org/T116035) [16:38:18] (03CR) 10Joal: [C: 031] "Great, thanks Andrew :)" [puppet] - 10https://gerrit.wikimedia.org/r/255550 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [16:40:01] (03CR) 10Ottomata: [C: 032] Use --until for eventlogging raw vs validated check [puppet] - 10https://gerrit.wikimedia.org/r/255550 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [16:41:29] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: puppet fail [16:50:15] (03PS1) 10Bmansurov: Enable RelatedArticles on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255553 (https://phabricator.wikimedia.org/T116676) [16:50:43] !log backfilling user.user_touched from dbstore1002 to sanitarium on all wikis [16:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:19] (03CR) 10Bmansurov: [C: 04-1] "SWAT deploy when Cards passes security review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255553 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [16:55:50] RECOVERY - pybal on lvs3001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [17:02:59] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 1 failures [17:12:16] (03PS2) 10Filippo Giunchedi: swiftrepl: move to argparse and ConfigParser [software] - 10https://gerrit.wikimedia.org/r/253311 [17:12:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftrepl: move to argparse and ConfigParser [software] - 10https://gerrit.wikimedia.org/r/253311 (owner: 10Filippo Giunchedi) [17:12:46] (03PS2) 10Filippo Giunchedi: swiftrepl: add main() [software] - 10https://gerrit.wikimedia.org/r/254144 [17:12:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftrepl: add main() [software] - 10https://gerrit.wikimedia.org/r/254144 (owner: 10Filippo Giunchedi) [17:13:04] (03PS2) 10Filippo Giunchedi: swiftrepl: add setup.py [software] - 10https://gerrit.wikimedia.org/r/254145 [17:13:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftrepl: add setup.py [software] - 10https://gerrit.wikimedia.org/r/254145 (owner: 10Filippo Giunchedi) [17:13:21] (03PS2) 10Filippo Giunchedi: swiftrepl: add 'container-set' selection [software] - 10https://gerrit.wikimedia.org/r/254146 [17:13:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftrepl: add 'container-set' selection [software] - 10https://gerrit.wikimedia.org/r/254146 (owner: 10Filippo Giunchedi) [17:25:57] (03PS1) 10coren: Labs: switch PAM handling to use pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) [17:26:05] paravoid: One you should appreciate ^^ [17:27:44] * Coren goes grab lunch. [17:27:47] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:31:25] (03CR) 10Alex Monk: "What's the rationale for this being wikipedia-only?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255553 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [17:44:00] (03PS3) 10Muehlenhoff: Uninstall apport [puppet] - 10https://gerrit.wikimedia.org/r/253593 [17:51:48] (03PS2) 10coren: Labs: switch PAM handling to use pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) [17:52:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Uninstall apport [puppet] - 10https://gerrit.wikimedia.org/r/253593 (owner: 10Muehlenhoff) [18:21:35] if neodymium whines that there's no salt minion, that's me, please ignore [18:22:22] (03PS1) 10Joal: Grow EventLogging graphite check lag 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/255556 [18:24:11] (03PS2) 10Joal: Grow EventLogging graphite check lag 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/255556 [18:24:35] apergos: so, I know that you have been discussing this with other opsen and documenting it, but I have not always been able to keep up. What is the TL;DR on Salt jobs not reaching all minions, or on job status not always propagating back to the salt master? Is there a way to solve that once and for all, or does Salt mandate that we be comfortable with a margin of error? [18:25:17] PROBLEM - salt-minion processes on neodymium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:25:24] and there's the whine, please ignore, thanks [18:26:58] ori: I wish I had a sound-byte answer [18:27:13] somebody is really flooding wikidata query service with requests... is it possible to temporarily block an IP from query.wikidata.org? [18:27:32] theoretically you can never be 100% sure because of the pub sub model. but. [18:28:07] (03CR) 10Ottomata: [C: 032] Grow EventLogging graphite check lag 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/255556 (owner: 10Joal) [18:28:10] in practice yes it's fixable with reasonable allocation of resources [18:28:35] how far away are free from having those? [18:29:12] thanks for working on it btw; my preference was that we switch to another stack, but improving salt is certainly better than living with the former status quo [18:29:20] well some of those allocation of resources are "finish backporting this patch and if it works on one minion then try it carefully on others" [18:29:26] nod [18:29:34] which a resource use reduction sort of thing [18:29:40] *which is [18:30:15] once that is in place we should see... unless I'm really off base here... good performance on test.ping across the cluster unbatched [18:30:21] and with normal client response [18:30:24] SMalyshev: Is it behind misc. varnish? [18:30:33] hoo: yes [18:30:43] yippie [18:30:43] that will mean that all targetted commands will work properly, since that's what they use too [18:30:44] bblack: ^ in that case [18:31:10] cool [18:31:19] I'm reserving the right to add on extra time to that though, in case one of the issues I see doesn't go away with this patch [18:34:07] * ori nods [18:38:40] RECOVERY - salt-minion processes on neodymium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:38:45] anybody to help with varnish/query service stuff? [18:41:38] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48356 bytes in 2.675 second response time [18:47:28] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:55:11] apergos: ping? [19:13:08] (03CR) 10Faidon Liambotis: [C: 04-1] Labs: switch PAM handling to use pam-auth-update (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) (owner: 10coren) [19:52:49] (03CR) 10coren: "Comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) (owner: 10coren) [19:55:49] (03PS10) 10Ottomata: Add format.topic configuration [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/230173 (https://phabricator.wikimedia.org/T108379) [19:58:26] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1834392 (10ArielGlenn) Fighting with a backport of the conversion of SAuth to a singleton, from fb747fa of the development branch, Tedious. [20:04:20] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: DoS on wikidata query service - https://phabricator.wikimedia.org/T119737#1834394 (10Smalyshev) 3NEW [20:04:44] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1834402 (10Smalyshev) [20:06:37] (03PS11) 10Ottomata: Add format.topic configuration [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/230173 (https://phabricator.wikimedia.org/T108379) [20:55:47] (03CR) 10Aaron Schulz: [C: 032] Adjust connection timeout and maxPartitionsTry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255520 (owner: 10Aaron Schulz) [20:56:26] (03Merged) 10jenkins-bot: Adjust connection timeout and maxPartitionsTry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255520 (owner: 10Aaron Schulz) [20:59:29] !log aaron@tin Synchronized wmf-config/jobqueue-eqiad.php: Adjust connection timeout and maxPartitionsTry (duration: 00m 34s) [20:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:35] Reedy: https://logstash.wikimedia.org/#dashboard/temp/AVFFpKXMptxhN1XaqkW3 looking better [21:13:45] :) [21:25:00] 6operations: Investigate redis connections errors on rdb100[13] - https://phabricator.wikimedia.org/T119739#1834444 (10aaron) [21:25:04] 6operations: Investigate redis connections errors on rdb100[12] - https://phabricator.wikimedia.org/T119739#1834437 (10aaron) 3NEW [21:28:55] (03CR) 10coren: "(Further comment inline)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) (owner: 10coren) [21:37:35] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: DoS on wikidata query service - https://phabricator.wikimedia.org/T119737#1834457 (10Smalyshev) OK, looks like it stopped, so I'll create separate ticket for per-IP limits. Looking at https://grafana.wikimedia.org/dashboard/db/wikidata-query-se... [21:37:43] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: DoS on wikidata query service - https://phabricator.wikimedia.org/T119737#1834458 (10Smalyshev) 5Open>3Resolved [22:11:17] (03PS1) 10Yuvipanda: Revert "Revert "k8s: Add ServiceAccount and ResourceQuota admission controllers"" [puppet] - 10https://gerrit.wikimedia.org/r/255653 [22:19:46] 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf3 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1834490 (10hashar) [22:20:18] 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf3 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1833909 (10hashar) zuul_2.1.0-60-g1cc37f7-wmf3 is bugged (T119741). Need a new version. [22:36:08] 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf3 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1834527 (10hashar) [22:37:00] 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf4 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1834528 (10hashar) [22:37:27] 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf4 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1833909 (10hashar) I have build a new precise version. Have to do the building dance for Trusty and Jessie now. [23:08:49] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [100000000.0] [23:35:58] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]