[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161130T0000). Please do the needful. [00:00:04] kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:19] yeehaw! here we go! [00:00:47] andre__: I'm going to remove HTTPS from T149977 since I don't think is relevant and adds traffic + operations, thoughts? [00:00:48] T149977: After login, user not logged in when "prefershttps" set to false and "wgSecureLogin" set to true - https://phabricator.wikimedia.org/T149977 [00:01:09] or not [00:01:12] godog, do it? :) [00:01:21] I guess the train deployment didn't happen? [00:01:27] My patch depends on the train [00:02:15] jouncebot: now [00:02:15] For the next 0 hour(s) and 57 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161130T0000) [00:02:32] andre__: heh, I thought #HTTPS might have had some hidden meaning I wasn't aware of [00:02:50] godog: I won't know more than what its project description says, sorry :P [00:03:03] (03PS2) 10Dzahn: Phabricator: Don't use vcs group, use phd [puppet] - 10https://gerrit.wikimedia.org/r/323996 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [00:03:13] hehe fair [00:05:05] greg-g: Should I reschedule my SWAT patch for tomorrow? It depends on the train deployment. [00:05:10] (03CR) 10Dzahn: [C: 032] Phabricator: Don't use vcs group, use phd [puppet] - 10https://gerrit.wikimedia.org/r/323996 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [00:09:18] (03CR) 10Dzahn: "Notice: /Stage[main]/Phabricator/Phabricator::Conf_env[vcs]/File[/srv/phab/phabricator/conf/local/vcs.json]/group: group changed 'vcs' to " [puppet] - 10https://gerrit.wikimedia.org/r/323996 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [00:10:19] 06Operations, 06Discovery, 06Maps: Investigate Swift as a storage backend for maps tiles - https://phabricator.wikimedia.org/T149885#2833686 (10fgiunchedi) p:05Triage>03Normal >>! In T149885#2768978, @MaxSem wrote: > Now that we know our space requirements are still low, we can investigate our options fu... [00:10:39] _joe_: Any update on things? It's currently SWAT deployment time, but looks like the train was blocked by T151702. Should I reschedule SWAT patches for tomorrow instead? [00:10:40] T151702: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702 [00:12:37] kaldari: group0 is at wmf.4 [00:12:47] oh cool [00:13:00] https://tools.wmflabs.org/versions/ [00:13:28] there was a little hiccup but o.striches handled it [00:13:59] bd808: that's a handy tool [00:14:35] it is indeed. some master craftsman must have made it ;) [00:14:51] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: Trebuchet targets for test/testrepo are out of date - https://phabricator.wikimedia.org/T149180#2833693 (10fgiunchedi) p:05Triage>03Low [00:14:54] no doubt :) [00:15:27] if you click on the version numbers it will show you the wikis in that group too [00:15:38] 06Operations, 10Wikimedia-Logstash: fix partition scheme for logstash ingester hosts - https://phabricator.wikimedia.org/T150108#2833694 (10fgiunchedi) p:05Triage>03Normal [00:15:40] nice [00:16:17] bd808: Where is everybody? This channel is a ghosttown today. [00:16:56] nice color palette too [00:17:30] kaldari: *shrug* making charitable donations? [00:17:38] no doubt [00:18:40] (03PS2) 10Kaldari: Test cookie blocking on Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322154 (https://phabricator.wikimedia.org/T150991) [00:26:11] (03PS1) 10Filippo Giunchedi: logstash: switch to /srv partitioning for ingester hosts [puppet] - 10https://gerrit.wikimedia.org/r/324362 (https://phabricator.wikimedia.org/T150108) [00:26:16] bd808: how much of a problem it is ATM to reimage logstash ingester hosts? for ^ [00:26:32] IIRC it wasn't behind pybal yet ? [00:26:52] no, and it probably won't be [00:27:22] several of the protocols we use are udp based and use multiple packets per message [00:27:41] so all the UDP needs to go to the same host [00:27:52] for a given protocol [00:27:58] anyone object if I deploy my config change on test Wikipedia (https://gerrit.wikimedia.org/r/#/c/322154)? Looks like no one's doing SWAT deployments today. [00:28:16] kaldari: go for it [00:28:19] plus I haven't broken the sites in a while [00:28:47] (03CR) 10Kaldari: [C: 032] Test cookie blocking on Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322154 (https://phabricator.wikimedia.org/T150991) (owner: 10Kaldari) [00:28:53] godog: we can tweak the mediawiki config pretty easily to migrate traffic to a subset of hosts [00:29:07] bd808: ah, lvs can do source hashing though for that [00:29:20] (03Merged) 10jenkins-bot: Test cookie blocking on Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322154 (https://phabricator.wikimedia.org/T150991) (owner: 10Kaldari) [00:29:24] but hhvm and other log sources are basically hard coded to particular backends [00:29:36] godog: oh? somebody should test that out then :) [00:30:06] bd808: We're still using tin to do all the syncing from, right? [00:30:22] just want to double check :) [00:30:22] godog: we can't play with lvs in beta cluster due to the OpenStack network layer so changes like that are blind to me [00:30:41] kaldari: yeah. and there is a new step to try things out before breaking the whole cluster [00:30:47] kaldari: let me find the link [00:32:16] kaldari: https://wikitech.wikimedia.org/wiki/SWAT_deploys#Doing_the_deploy [00:32:30] oops mw1099 needs to be changed there [00:33:16] the test host is mwdebug1002 now [00:33:32] thanks [00:33:56] bd808: indeed, we could add another service for logstash ingestion to the existing logstash.svc [00:34:04] so fetch on tin like always, scap pull on mwdebug1002, test with X-Wikimedia-Debug, scap sync-file or whatever [00:34:51] godog: if you can get things moved to have a balancer in front of the logstash service that would be awesome [00:35:17] right now there are some things pinned to each of the 3 physical hosts [00:35:55] hhvm syslog to 01, restbase to 03, *something* to 02 [00:36:11] mediawiki is the only thing that spreads out over all 3 [00:36:34] 06Operations, 10Wikimedia-Logstash: Move logstash ingestion behind LVS - https://phabricator.wikimedia.org/T151971#2833764 (10fgiunchedi) [00:36:47] bd808: cded to /srv/mediawiki-staging, did git fetch, but git diff wmf-config shows nothing. Am I missing a step? [00:36:50] 06Operations, 10Wikimedia-Logstash: Move logstash ingestion behind LVS - https://phabricator.wikimedia.org/T151971#2833776 (10fgiunchedi) p:05Triage>03Normal [00:37:02] change is merged: https://gerrit.wikimedia.org/r/#/c/322154/ [00:37:40] bd808: yeah I can get things lined up over time but not deployed until after the holidays [00:38:00] kaldari: git log --stat HEAD..@{upstream} will show you what is fetched but not staged in the index [00:38:23] kaldari: then git rebase @{upstream} to actually apply the pending changes [00:38:37] godog: *nod* [00:38:40] "git diff HEAD origin" show it [00:38:43] shows it [00:39:01] 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review: fix partition scheme for logstash ingester hosts - https://phabricator.wikimedia.org/T150108#2833778 (10fgiunchedi) [00:39:03] 06Operations, 10Wikimedia-Logstash: Move logstash ingestion behind LVS - https://phabricator.wikimedia.org/T151971#2833777 (10fgiunchedi) [00:39:17] kaldari: yup that's another random way to look at the diff [00:39:35] so then rebase to get it applied and your ready to sync things [00:40:06] that works [00:40:27] !log phab2001 - enabled puppet to bring it up2date with a various changes [00:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:38] godog: back to your original question, it will cause us to lose some logs but its not the end of the world [00:41:17] godog: would it be a full reimage of the hosts? [00:41:36] * bd808 probably has things in ~ that aren't backed up [00:41:50] bd808: permission denied for ssh mwdebug1002 [00:42:40] kaldari: from your laptop? [00:42:59] from tin [00:43:15] I'll try from laptop [00:43:26] If you want to jump over from tin you can do `SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mwdebug1002.eqiad.wmnet` [00:43:53] cool, thanks! [00:44:25] the prod bastions don't allow agent forwarding anymore so you have to either come in from the outside to each host or cheat and use the scap ssh agent [00:45:28] synced on 1002, testing... [00:46:04] bd808: oh yeah, I guess agent forwarding wasn't a good idea :) [00:49:18] bd808: yeah, but not urgent at all, just OCD [00:50:08] PROBLEM - HHVM processes on mw1276 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [00:50:08] PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [00:50:11] the most useful thing on those boxes is my ~/.bash_history so I don't have to remember how to do things; just grep [00:51:08] RECOVERY - HHVM processes on mw1276 is OK: PROCS OK: 6 processes with command name hhvm [00:51:08] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 71534 bytes in 0.131 second response time [00:52:53] !log on mw1276: tuning jemalloc, will restart hhvm several times, running it in a terminal [00:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:16] !log kaldari@tin Synchronized wmf-config/InitialiseSettings.php: sync InitialiseSettings to test cookie blocking on Test Wikipedia (duration: 00m 45s) [00:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:26] RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042 [00:59:59] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2833815 (10Krinkle) >>! In T149617#2832982, @aaron wrote: > The background process would write... [01:00:41] (03PS1) 10Dzahn: phab: fix systemd unit file name of ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/324369 [01:01:36] PROBLEM - PHD should be supervising processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [01:02:46] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 49 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/etc/systemd/system/ssh-phab.service] [01:03:24] hmm why is phab2001 alerting [01:03:56] PROBLEM - Check systemd state on mw1276 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:04:06] twentyafterfour: that's me [01:04:14] twentyafterfour: i was talking in releng [01:04:22] (03CR) 1020after4: [C: 031] phab: fix systemd unit file name of ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/324369 (owner: 10Dzahn) [01:04:30] :) yes, and that should fix it [01:04:32] thanks [01:04:36] mutante: cool [01:04:56] also, puppet ran on that and it got a bunch of other updates [01:05:05] confirmed that it didnt break git-ssh [01:05:08] cool [01:05:54] we should be very nearly ready to make phab2001 be a real hot spare for repositories and warm backup for web [01:06:24] twentyafterfour it seems phab ssh service is failing puppet [01:06:33] Error: /Stage[main]/Phabricator::Vcs/File[/etc/systemd/system/ssh-phab.service]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/phabricator/sshd-phab.service [01:06:56] paladox: I think mutante just fixed that with https://gerrit.wikimedia.org/r/#/c/324369/1 [01:07:08] oh [01:07:10] thanks [01:08:06] (03CR) 10Paladox: [C: 031] "Good notice, I just noticed it just now :)" [puppet] - 10https://gerrit.wikimedia.org/r/324369 (owner: 10Dzahn) [01:08:38] (03PS1) 10Filippo Giunchedi: lvs: add logstash [puppet] - 10https://gerrit.wikimedia.org/r/324371 (https://phabricator.wikimedia.org/T151971) [01:10:33] (03PS2) 10Dzahn: phab: fix systemd unit file name of ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/324369 [01:11:44] (03PS3) 10Dzahn: phab: fix systemd unit file name of ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/324369 (https://phabricator.wikimedia.org/T137928) [01:11:50] (03PS4) 10Dzahn: phab: fix systemd unit file name of ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/324369 (https://phabricator.wikimedia.org/T137928) [01:13:08] (03CR) 10Dzahn: [C: 032] phab: fix systemd unit file name of ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/324369 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [01:13:55] (03PS1) 10Filippo Giunchedi: templates: add PTR for pdfrender [dns] - 10https://gerrit.wikimedia.org/r/324372 [01:13:57] (03PS1) 10Filippo Giunchedi: templates: add logstash.svc [dns] - 10https://gerrit.wikimedia.org/r/324373 (https://phabricator.wikimedia.org/T151971) [01:14:13] 06Operations, 10Parsoid, 06Release-Engineering-Team: Provide a /parsoid directory on releases.wikimedia.org - https://phabricator.wikimedia.org/T150672#2833841 (10fgiunchedi) p:05Triage>03Normal [01:14:36] mutante twentyafterfour phabricator now works on labs [01:14:40] no puppet errors [01:14:43] :) [01:14:46] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [01:15:23] 06Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#2833846 (10fgiunchedi) p:05Triage>03Normal [01:15:26] paladox: :) yay [01:15:34] 06Operations: Puppet CA rollover - https://phabricator.wikimedia.org/T150823#2833847 (10fgiunchedi) p:05Triage>03Normal [01:15:47] Yep, we can finally move away from the phabricator labs class [01:16:29] * mutante awards a token [01:16:50] LOL, :):):):):):):) [01:17:06] PROBLEM - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:17:12] 06Operations, 06DC-Ops: Racktables equipment that should probably be renamed ? - https://phabricator.wikimedia.org/T150744#2833860 (10fgiunchedi) p:05Triage>03Normal [01:17:27] mutante twentyafterfour only thing left is to make the domain configurable in the apache file [01:17:34] but i have to go now [01:19:13] paladox: great, continue tomorrow please, thanks [01:19:21] cu later [01:19:22] Ok [01:19:29] and you too :) [01:19:56] RECOVERY - Check systemd state on mw1276 is OK: OK - running: The system is fully operational [01:20:07] we can move on with rest of T137928 now i think [01:20:08] T137928: Deploy phabricator to phab2001.codfw.wmnet - https://phabricator.wikimedia.org/T137928 [01:20:19] ok [01:20:20] :) [01:20:22] since the networking stuff is unblocked [01:20:25] or should be [01:20:26] yep [01:28:52] ACKNOWLEDGEMENT - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T137928 [01:28:52] ACKNOWLEDGEMENT - PHD should be supervising processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) daniel_zahn https://phabricator.wikimedia.org/T137928 [01:55:06] 06Operations, 10Mail: Update legal-tm-vio@ alias - https://phabricator.wikimedia.org/T150463#2833914 (10fgiunchedi) 05Open>03Resolved p:05Triage>03Normal a:03fgiunchedi @Slaporte I've added both to `legal-tm-vio@` now, note that `trademark@` is a group so if recipients are in both they would get mail... [01:56:39] 06Operations, 10puppet-compiler, 15User-Joe: puppet compiler fails with modules using puppetdb - https://phabricator.wikimedia.org/T150456#2833919 (10fgiunchedi) p:05Triage>03Normal [01:59:20] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 2 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2833920 (10tstarling) >>! In T151702#2831448, @Joe wrote: > From a quick look, most threads seem effectively blocked in a very simple function: > > ``` > je... [02:07:35] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2833935 (10fgiunchedi) p:05Triage>03Normal `/tmp` keeps getting full with temporary directories that are never cleaned up. Interestingly all files in there are either one byte or 4194304 bytes so some... [02:07:35] !log Updated Wikidata's property suggester with data from Monday's json dump and applied the T132839 workarounds [02:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:47] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [02:08:16] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:16] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:16] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:09:06] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [02:09:06] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [02:09:06] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [02:10:37] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Wikidata Query Service is overly verbose toward logstash - https://phabricator.wikimedia.org/T150356#2833943 (10fgiunchedi) p:05Triage>03Normal [02:10:48] 06Operations, 06Security-Team: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300#2833944 (10fgiunchedi) p:05Triage>03Normal [02:10:57] 06Operations, 10Analytics: Install java 8 to stat1002 - https://phabricator.wikimedia.org/T151896#2833945 (10fgiunchedi) p:05Triage>03Normal [02:15:04] 06Operations, 10Analytics: Install java 8 to stat1002 - https://phabricator.wikimedia.org/T151896#2833946 (10fgiunchedi) IIRC the alternatives should already prefer java-7 if both are installed, I'm not sure to which puppet class/role to add the package though (cc @Ottomata @elukey ) [02:15:17] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2833948 (10fgiunchedi) p:05Triage>03Normal [02:29:56] 06Operations, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2730181 (10MaxSem) Uh, this renderer has not only Chinese documentation and comments, but even identifiers are in Chinese in some places. To me, this means that (almost?) n... [02:33:19] 06Operations: reinstall rcs100[12] with RAID - https://phabricator.wikimedia.org/T140441#2464918 (10fgiunchedi) Looks like both machine might actually have only one disk. Both machines are out of warranty since 2014, we can probably move at least one to a VM [03:24:56] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 738.75 seconds [03:47:56] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 292.72 seconds [04:05:46] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:34:46] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:45:26] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:55:06] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [04:56:06] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4755300 keys, up 29 days 20 hours - replication_delay is 49 [04:59:06] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:13:26] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [05:13:33] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 2 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2834167 (10tstarling) Filed upstream bug https://github.com/facebook/hhvm/issues/7515 , but we're not blocked on it, we can use the MALLOC_CONF environment v... [05:26:06] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [05:40:42] (03CR) 10Krinkle: [C: 04-1] "Per IRC. docroot/mobileportal is symlinked to m.wikipedia.org and mobilelanding.php is used in various places." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323999 (owner: 10Chad) [05:41:33] (03Abandoned) 10Krinkle: Remove bits.wikimedia.org apache config [puppet] - 10https://gerrit.wikimedia.org/r/322420 (https://phabricator.wikimedia.org/T107430) (owner: 10Alex Monk) [05:42:08] (03PS3) 10Krinkle: Remove bits docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317657 (owner: 10Chad) [05:49:26] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=986.80 Read Requests/Sec=356.00 Write Requests/Sec=3.50 KBytes Read/Sec=44060.00 KBytes_Written/Sec=90.40 [05:57:26] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=10.80 Read Requests/Sec=10.60 Write Requests/Sec=225.40 KBytes Read/Sec=70.80 KBytes_Written/Sec=2885.60 [06:27:06] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:35:06] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tree] [06:37:12] RIP, bits. [06:45:26] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:53:11] (03PS1) 10Yuvipanda: statistics: use R from jessie-backports on jessie boxes [puppet] - 10https://gerrit.wikimedia.org/r/324384 [06:56:06] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:00:14] (03PS2) 10Yuvipanda: statistics: use R from jessie-backports on jessie boxes [puppet] - 10https://gerrit.wikimedia.org/r/324384 [07:02:06] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:03:07] (03CR) 10Yuvipanda: [C: 032] statistics: use R from jessie-backports on jessie boxes [puppet] - 10https://gerrit.wikimedia.org/r/324384 (owner: 10Yuvipanda) [07:06:14] !log Stop mysql db2048 maintenance - T149553 [07:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:26] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [07:17:32] !log Deploy alter table dbstore1002 - dewiki.revision - T148967 [07:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:44] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [08:01:45] 06Operations, 10ops-eqiad, 10hardware-requests: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares - https://phabricator.wikimedia.org/T146171#2834278 (10Joe) I'm not clear if there is anything I should do about this ticket [08:42:12] (03PS1) 10Jcrespo: mariadb: change check_private_data to print DROP statements [puppet] - 10https://gerrit.wikimedia.org/r/324386 (https://phabricator.wikimedia.org/T147052) [08:42:28] (03PS2) 10Jcrespo: mariadb: change check_private_data to print DROP statements [puppet] - 10https://gerrit.wikimedia.org/r/324386 (https://phabricator.wikimedia.org/T147052) [08:43:43] (03CR) 10Marostegui: "nice change, a lot easier to handle the future drops directly from the output!" [puppet] - 10https://gerrit.wikimedia.org/r/324386 (https://phabricator.wikimedia.org/T147052) (owner: 10Jcrespo) [08:43:52] (03CR) 10Marostegui: [C: 031] mariadb: change check_private_data to print DROP statements [puppet] - 10https://gerrit.wikimedia.org/r/324386 (https://phabricator.wikimedia.org/T147052) (owner: 10Jcrespo) [08:44:09] (03CR) 10Jcrespo: [C: 032] mariadb: change check_private_data to print DROP statements [puppet] - 10https://gerrit.wikimedia.org/r/324386 (https://phabricator.wikimedia.org/T147052) (owner: 10Jcrespo) [08:44:31] <_joe_> !log stopped dedicated commonswiki jobrunner T151196 [08:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:43] T151196: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196 [08:51:40] (03PS2) 10Volans: RAID: reduce MegaCLI sensibility (physical disks) [puppet] - 10https://gerrit.wikimedia.org/r/324240 (https://phabricator.wikimedia.org/T151043) [09:04:24] 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2834319 (10Ankry) >>! In T151196#2823445, @matmarex wrote: > Until the normal job processing is fixed to cope with th... [09:07:46] (03PS1) 10Aaron Schulz: Bump $wgJobBackoffThrottling for cache purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324388 [09:13:19] 06Operations, 10OCG-General: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#1245233 (10Volans) The alarm is on again since 3 days on Icinga, and looking at the last 6 months trend it seems that the alarm might need some re-tuning if the trend is legitimate and not an indic... [09:28:16] (03PS6) 10Elukey: Refactor the parsing functions out of the main C file [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322257 (https://phabricator.wikimedia.org/T147440) [09:37:56] 06Operations, 10Parsoid, 06Release-Engineering-Team: Provide a /parsoid directory on releases.wikimedia.org - https://phabricator.wikimedia.org/T150672#2792988 (10Legoktm) A new directory can be created by defining it in puppet: https://github.com/wikimedia/operations-puppet/blob/production/modules/releases/... [09:38:42] 06Operations, 10Traffic: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2824625 (10ema) @doctaxon still happening? [09:42:11] (03PS1) 10Jcrespo: mariadb: fix bugs with check_private_data regarding DROP and NULL [puppet] - 10https://gerrit.wikimedia.org/r/324390 (https://phabricator.wikimedia.org/T147052) [09:42:52] !log Stop mysql on db2048 for maintenance - https://phabricator.wikimedia.org/T149553 [09:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:11] 06Operations, 10Traffic, 13Patch-For-Review: Huge increase in cache_upload 404s due to buggy client-side code from graphiq.com - https://phabricator.wikimedia.org/T151444#2834370 (10ema) p:05Normal>03Low The amount of upload 404s decreased significantly, we're almost back to normal: https://grafana.wikim... [09:46:41] (03CR) 10Zfilipin: "I cherry picked it at the CI puppet master. Will puppet run automatically, or should I run it manually?" [puppet] - 10https://gerrit.wikimedia.org/r/324203 (https://phabricator.wikimedia.org/T117418) (owner: 10Zfilipin) [09:47:26] (03CR) 10Marostegui: [C: 031] mariadb: fix bugs with check_private_data regarding DROP and NULL [puppet] - 10https://gerrit.wikimedia.org/r/324390 (https://phabricator.wikimedia.org/T147052) (owner: 10Jcrespo) [09:48:20] (03CR) 10Jcrespo: [C: 032] mariadb: fix bugs with check_private_data regarding DROP and NULL [puppet] - 10https://gerrit.wikimedia.org/r/324390 (https://phabricator.wikimedia.org/T147052) (owner: 10Jcrespo) [10:15:26] (03PS1) 10Giuseppe Lavagetto: mediawiki::hhvm: allow to override the default jemalloc arenas [puppet] - 10https://gerrit.wikimedia.org/r/324394 (https://phabricator.wikimedia.org/T151702) [10:16:37] (03PS1) 10ArielGlenn: miscdumps: make refresh interval for lock a few seconds shorter than stale time [dumps] - 10https://gerrit.wikimedia.org/r/324395 [10:17:19] (03CR) 10ArielGlenn: [C: 032] miscdumps: make refresh interval for lock a few seconds shorter than stale time [dumps] - 10https://gerrit.wikimedia.org/r/324395 (owner: 10ArielGlenn) [10:22:01] 06Operations, 10netops: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2834399 (10elukey) [10:22:16] 06Operations, 10netops: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2834411 (10elukey) p:05Triage>03Normal [10:25:31] (03PS2) 10Giuseppe Lavagetto: mediawiki::hhvm: allow to override the default jemalloc arenas [puppet] - 10https://gerrit.wikimedia.org/r/324394 (https://phabricator.wikimedia.org/T151702) [10:33:44] 06Operations, 10netops: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2834422 (10elukey) Had a chat with Alex on IRC about what stat1001 does and what level of access it should have. For the Apache VHosts point of view it would be better to have... [10:35:08] (03CR) 10Volans: prometheus: add vhtcpd stats via node-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) (owner: 10Filippo Giunchedi) [10:37:23] 06Operations, 10Analytics, 10netops: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2834430 (10elukey) [10:48:21] (03PS3) 10Zfilipin: ChromeDriver should be in PATH for jobs that run Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/324203 (https://phabricator.wikimedia.org/T117418) [10:48:46] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:50:24] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, and 2 others: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2834456 (10Gehel) 05Open>03Resolved It looks like the script is now performing correctly, no error seen... [10:51:22] (03CR) 10Hashar: [C: 031] "Got cherry picked on the CI puppet master and that provisioned the symbolic links on the permanent slaves." [puppet] - 10https://gerrit.wikimedia.org/r/324203 (https://phabricator.wikimedia.org/T117418) (owner: 10Zfilipin) [10:52:07] (03PS1) 10MarcoAurelio: Allow contentadmin and sysop to add/remove autopatrolled users on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324401 [10:52:12] (03PS1) 10Jcrespo: mariadb: More bugfixes for check_private_data.py [puppet] - 10https://gerrit.wikimedia.org/r/324402 (https://phabricator.wikimedia.org/T147052) [10:52:27] (03PS3) 10Giuseppe Lavagetto: mediawiki::hhvm: allow to override the default jemalloc arenas [puppet] - 10https://gerrit.wikimedia.org/r/324394 (https://phabricator.wikimedia.org/T151702) [10:53:08] (03CR) 10jenkins-bot: [V: 04-1] mariadb: More bugfixes for check_private_data.py [puppet] - 10https://gerrit.wikimedia.org/r/324402 (https://phabricator.wikimedia.org/T147052) (owner: 10Jcrespo) [10:54:59] (03PS2) 10Jcrespo: mariadb: More bugfixes for check_private_data.py [puppet] - 10https://gerrit.wikimedia.org/r/324402 (https://phabricator.wikimedia.org/T147052) [10:58:25] (03PS3) 10Jcrespo: mariadb: More bugfixes for check_private_data.py [puppet] - 10https://gerrit.wikimedia.org/r/324402 (https://phabricator.wikimedia.org/T147052) [10:59:26] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::hhvm: allow to override the default jemalloc arenas [puppet] - 10https://gerrit.wikimedia.org/r/324394 (https://phabricator.wikimedia.org/T151702) (owner: 10Giuseppe Lavagetto) [10:59:37] (03PS4) 10Jcrespo: mariadb: More bugfixes for check_private_data.py [puppet] - 10https://gerrit.wikimedia.org/r/324402 (https://phabricator.wikimedia.org/T147052) [11:00:33] (03CR) 10Jcrespo: [C: 032] mariadb: More bugfixes for check_private_data.py [puppet] - 10https://gerrit.wikimedia.org/r/324402 (https://phabricator.wikimedia.org/T147052) (owner: 10Jcrespo) [11:07:23] <_joe_> !log rolling upgrade of hhvm on the eqiad api cluster [11:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:39] (03PS1) 10Jcrespo: mariadb: Fix final check_private_data.py quoting bugs [puppet] - 10https://gerrit.wikimedia.org/r/324404 [11:08:10] (03PS2) 10Jcrespo: mariadb: Fix final check_private_data.py quoting bugs [puppet] - 10https://gerrit.wikimedia.org/r/324404 [11:08:46] 06Operations, 10Traffic, 13Patch-For-Review: Huge increase in cache_upload 404s due to buggy client-side code from graphiq.com - https://phabricator.wikimedia.org/T151444#2834521 (10ema) Leaving this ticket open though given that they still haven't fixed the javascript bug. The decrease in 404s is probably d... [11:08:52] jynus: can you ping me when you have to merge this? ^^^ same test of yesterday ;) [11:09:04] as soon as it +2 [11:09:14] (jenkins does) [11:09:39] so, now [11:09:52] (03CR) 10Jcrespo: [C: 032] mariadb: Fix final check_private_data.py quoting bugs [puppet] - 10https://gerrit.wikimedia.org/r/324404 (owner: 10Jcrespo) [11:10:05] volans^ [11:10:05] ok, merge on gerrit, I'll take care of puppet-merge in a second [11:10:07] thanks [11:10:10] (03CR) 10Faidon Liambotis: [C: 032] RAID: reduce MegaCLI sensibility (physical disks) [puppet] - 10https://gerrit.wikimedia.org/r/324240 (https://phabricator.wikimedia.org/T151043) (owner: 10Volans) [11:10:33] (03PS1) 10ArielGlenn: miscdumps: fix up config defaults [dumps] - 10https://gerrit.wikimedia.org/r/324406 [11:10:35] (03PS3) 10Volans: RAID: reduce MegaCLI sensibility (physical disks) [puppet] - 10https://gerrit.wikimedia.org/r/324240 (https://phabricator.wikimedia.org/T151043) [11:12:31] jynus: puppet-merged, thanks a lot! [11:12:58] paravoid: FYI all tests on puppet-merge molly guard successful, I'm merging that [11:13:27] can I ask the 1-line summary of that functionality? [11:13:57] sure, if there are commits from multiple committers when you run puppet-merge, instead of saying yes you have to say "multiple" [11:14:19] so if you type yes and enter without noticing the warning it aborts the merge [11:14:24] to prevent muscle memory errors [11:14:52] ofc it's saying that in the prompt, but I'll send an email too [11:14:55] what if I say yes (because there is 1 on check) but on merge there are multiple? [11:15:08] does it fix that? [11:15:56] not yet, also because the thing is async and run on multiple puppetmasters, so less trivial as a change [11:16:14] ok [11:16:19] yes, I know it is not simple [11:16:20] !log Stop replication s3 - db1095 - maintenance - T147052 [11:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:31] T147052: Provision with data the new labsdb servers and provide replica service with at least 1 shard from a sanitized copy from production - https://phabricator.wikimedia.org/T147052 [11:16:39] I looked at it and said myself "not worth the time" [11:16:50] (03PS3) 10Volans: Puppet merge: molly-guard multiple commits [puppet] - 10https://gerrit.wikimedia.org/r/322362 [11:17:13] (03CR) 10ArielGlenn: [C: 032] miscdumps: fix up config defaults [dumps] - 10https://gerrit.wikimedia.org/r/324406 (owner: 10ArielGlenn) [11:17:46] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:18:15] also if we cherry-pick instead of rebase/pull all sort of wrong things can go wrong [11:20:42] (03PS1) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [11:20:58] (03PS2) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [11:21:26] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:21:49] (03PS3) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [11:21:51] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [11:22:01] <_joe_> !log repooling mw1276, after tests for T151702 [11:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:11] T151702: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702 [11:22:12] (03PS4) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [11:23:09] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [11:25:00] (03PS1) 10ArielGlenn: update cron command and config file for adds/changes dumps [puppet] - 10https://gerrit.wikimedia.org/r/324409 [11:25:04] (03PS5) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [11:28:13] !log ariel@tin Starting deploy [dumps/dumps@50689c8]: (no message) [11:28:19] !log ariel@tin Finished deploy [dumps/dumps@50689c8]: (no message) (duration: 00m 07s) [11:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:32] (03PS2) 10ArielGlenn: update cron command and config file for adds/changes dumps [puppet] - 10https://gerrit.wikimedia.org/r/324409 [11:34:22] 06Operations, 10DBA: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2834580 (10jcrespo) [11:35:52] (03PS3) 10ArielGlenn: update cron command and config file for adds/changes dumps [puppet] - 10https://gerrit.wikimedia.org/r/324409 [11:36:03] (03PS1) 10Jcrespo: mariadb: Drop ssl (tls) options from external storage servers [puppet] - 10https://gerrit.wikimedia.org/r/324411 (https://phabricator.wikimedia.org/T151995) [11:36:06] (03PS1) 10Jcrespo: mariadb: Add semicolon after each SQL query output (private data check) [puppet] - 10https://gerrit.wikimedia.org/r/324412 (https://phabricator.wikimedia.org/T147052) [11:36:20] (03PS2) 10Jcrespo: mariadb: Add semicolon after each SQL query output (private data check) [puppet] - 10https://gerrit.wikimedia.org/r/324412 (https://phabricator.wikimedia.org/T147052) [11:37:07] (03CR) 10ArielGlenn: [C: 032] update cron command and config file for adds/changes dumps [puppet] - 10https://gerrit.wikimedia.org/r/324409 (owner: 10ArielGlenn) [11:37:13] jynus: I guess you and Ariel the unlucky candidate for my tests after merging :) [11:38:04] mmm [11:38:24] 324411 needs deep review [11:38:28] before merging [11:38:58] I was thiking 324412 [11:39:51] (03CR) 10Marostegui: [C: 031] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/324411 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [11:41:02] (03PS3) 10Jcrespo: mariadb: Add semicolon after each SQL query output (private data check) [puppet] - 10https://gerrit.wikimedia.org/r/324412 (https://phabricator.wikimedia.org/T147052) [11:42:36] (03CR) 10Volans: [C: 031] "LGTM, the delicate part is the rolling restart and reset of SSL parameters in replication, in particular the cross-dc ones." [puppet] - 10https://gerrit.wikimedia.org/r/324411 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [11:44:14] nuria: hola/hi ping re https://gerrit.wikimedia.org/r/#/c/323699/ [11:44:53] <_joe_> !log upgrading HHVM across appservers in eqiad [11:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:10] (03CR) 10Jcrespo: [C: 032] mariadb: Add semicolon after each SQL query output (private data check) [puppet] - 10https://gerrit.wikimedia.org/r/324412 (https://phabricator.wikimedia.org/T147052) (owner: 10Jcrespo) [11:47:23] volans^ [11:47:30] jynus: ok, running it [11:48:21] jynus: done, thanks again [11:49:26] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:50:12] (03PS7) 10Elukey: Refactor the parsing functions out of the main C file [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322257 (https://phabricator.wikimedia.org/T147440) [12:00:51] (03CR) 10Jcrespo: [C: 031] "Looks good: https://puppet-compiler.wmflabs.org/4720/" [puppet] - 10https://gerrit.wikimedia.org/r/324411 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [12:06:56] PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:33] (03PS2) 10Jcrespo: mariadb: Drop ssl (tls) options from external storage servers [puppet] - 10https://gerrit.wikimedia.org/r/324411 (https://phabricator.wikimedia.org/T151995) [12:17:40] (03CR) 10Jcrespo: [C: 032 V: 032] mariadb: Drop ssl (tls) options from external storage servers [puppet] - 10https://gerrit.wikimedia.org/r/324411 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [12:18:46] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:26:35] 06Operations, 10DBA, 13Patch-For-Review: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2834676 (10jcrespo) p:05Normal>03High a:03jcrespo [12:29:38] !log mysql restart and general upgrade for es2015 T151995 [12:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:52] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [12:30:48] (03CR) 10Elukey: [C: 04-1] "WIP" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322256 (https://phabricator.wikimedia.org/T147440) (owner: 10Elukey) [12:31:14] (03CR) 10Elukey: [C: 04-1] "WIP" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322257 (https://phabricator.wikimedia.org/T147440) (owner: 10Elukey) [12:32:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] docker: add package provider (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323815 (owner: 10Giuseppe Lavagetto) [12:35:56] RECOVERY - puppet last run on db1037 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:40:21] 06Operations, 10Traffic, 13Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#2834702 (10ema) I've confirmed with stap that the overruns come from vslc_vsm_next, which in turn calls vslc_vsm_check: https://github.com/varnishcache/varnish-ca... [12:40:51] (03CR) 10Volans: "I've run a quick puppet compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [12:44:21] 06Operations, 10Monitoring: dbstore1001 backup jobs failed between 2016-10-19 and 2016-11-23 - https://phabricator.wikimedia.org/T151579#2834707 (10jcrespo) 05Open>03Resolved a:03jcrespo Last backups seem to have been successful: ``` ls -lha enwiki* -rw-r----- 1 root root 84G Nov 23 06:46 enwiki-20161... [12:46:07] 06Operations, 10DBA, 10Monitoring: Create script to monitor db dumps for backups are successful (and if not, old backups are not deleted) - https://phabricator.wikimedia.org/T151999#2834710 (10jcrespo) [12:47:06] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:48:46] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:55:17] !log bumping vsl log buffer on cp3032 (depooled) -- T151643 [12:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:29] T151643: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643 [13:08:16] !log mysql restart and general upgrade for es2019 T151995 [13:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:28] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [13:09:39] !log reedy@tin Synchronized php-1.29.0-wmf.3/extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php: More skipping (duration: 01m 34s) [13:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:42] !log reedy@tin Synchronized php-1.29.0-wmf.4/extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php: More skipping (duration: 00m 44s) [13:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:51] !log mysql restart and general upgrade for es2017 T151995 [13:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:04] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [13:16:06] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [13:32:12] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern (or: revision retention policies considered harmful) - https://phabricator.wikimedia.org/T144431#2834839 (10mark) [13:38:17] 06Operations, 10Wikimedia-Site-requests: Add IPv6 address for dashboard.wikiedu.org to the ratelimit exemptions - https://phabricator.wikimedia.org/T151823#2834851 (10Dereckson) [13:38:59] Dereckson, isn't that in the mw config? [13:49:09] Krenair: yes, it is, I initially misread the request to add an IPv6 on a wmf server [13:55:18] !log Reset user email for projectcomwiki initial account "Mjohnson (WMF)" [13:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:36] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161130T1400). [14:00:04] Urbanecm, kart_, and mafk: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:24] * Urbanecm waves [14:00:26] Hello, I can SWAT. [14:00:31] (03PS1) 10Dereckson: Add dashboard.wikiedu.org IPv6 to en.wikipedia rate limit exempt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324446 (https://phabricator.wikimedia.org/T151823) [14:00:37] I'll also add this change ^ [14:00:51] I'm here as kart__ [14:01:41] (03CR) 10Urbanecm: [C: 031] "Fine for me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324446 (https://phabricator.wikimedia.org/T151823) (owner: 10Dereckson) [14:02:01] Dereckson: have fun! I was just about to ask who wants to swat [14:03:15] zeljkof and Dereckson, you both can swat I think :D [14:03:37] o/ [14:03:56] Urbanecm: I'll leave it to Dereckson today :) [14:04:11] I have no problem with it :). [14:05:55] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Decrease time required to fully restart the Cirrus elasticsearch clusters - https://phabricator.wikimedia.org/T145065#2834951 (10Gehel) 05declined>03Open Re-opening this and linking it to upstream ticket: https://github.com/elastic/elastic... [14:06:45] !log mysql restart and general upgrade for es2014 T151995 [14:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:56] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [14:07:22] (03PS1) 10Giuseppe Lavagetto: mediawiki: tweak jemalloc arenas on api, appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/324447 (https://phabricator.wikimedia.org/T151702) [14:08:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: tweak jemalloc arenas on api, appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/324447 (https://phabricator.wikimedia.org/T151702) (owner: 10Giuseppe Lavagetto) [14:08:32] (03PS2) 10Dereckson: [logo] Add logo for arbcom_cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324188 (https://phabricator.wikimedia.org/T151731) (owner: 10Urbanecm) [14:09:24] (03CR) 10Dereckson: [C: 032] [logo] Add logo for arbcom_cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324188 (https://phabricator.wikimedia.org/T151731) (owner: 10Urbanecm) [14:10:44] (03Merged) 10jenkins-bot: [logo] Add logo for arbcom_cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324188 (https://phabricator.wikimedia.org/T151731) (owner: 10Urbanecm) [14:10:54] (03PS3) 10Dereckson: [logo] Add logo to Wikivoyage Finnish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323697 (https://phabricator.wikimedia.org/T151571) (owner: 10Urbanecm) [14:11:37] (03CR) 10Dereckson: [C: 032] [logo] Add logo to Wikivoyage Finnish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323697 (https://phabricator.wikimedia.org/T151571) (owner: 10Urbanecm) [14:12:21] Marco isn't here. [14:12:54] (03Merged) 10jenkins-bot: [logo] Add logo to Wikivoyage Finnish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323697 (https://phabricator.wikimedia.org/T151571) (owner: 10Urbanecm) [14:13:21] How do we test wmgPrivilegedGroups, by asking someone from a group, not yet in another privileged group if 2FA works? [14:14:07] Dereckson, maybe a testaccount can become a member of a group and test it. But I have no access to rights at meta. [14:14:22] Urbanecm: your logos are live on mwdebug1002.eqiad.wmnet if you wish to check them at /static/... [14:14:31] Yes, going to check them. [14:15:37] Dereckson, you can deploy it to the whole network. [14:16:12] By the way, a scap pull on mwdebug1002 still hangs out after Finished rsync common [14:16:31] Does it mean something may be wrong? [14:16:59] Yes, but unrelated with the logos. [14:17:23] Okay, thanks. [14:17:35] !log dereckson@tin Synchronized static/images/project-logos: New project logos for wiki to create (arbcom cs, fi.wikivoyage) (duration: 00m 46s) [14:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:07] Dereckson, working, thanks. [14:18:34] (03PS2) 10Dereckson: Add dashboard.wikiedu.org IPv6 to en.wikipedia rate limit exempt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324446 (https://phabricator.wikimedia.org/T151823) [14:18:48] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324446 (https://phabricator.wikimedia.org/T151823) (owner: 10Dereckson) [14:19:28] (03Merged) 10jenkins-bot: Add dashboard.wikiedu.org IPv6 to en.wikipedia rate limit exempt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324446 (https://phabricator.wikimedia.org/T151823) (owner: 10Dereckson) [14:21:34] kart__: so if I understand https://phabricator.wikimedia.org/T151868#2830467 your fix is only for newest code in wmf.4 and the issue doesn't exist in wmf.3? [14:22:12] 324446 live on mwdebug1002 [14:22:26] Dereckson: right [14:22:36] Dereckson: it will go live later today. [14:23:56] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add dashboard.wikiedu.org IPv6 to en.wikipedia rate limit exempt (T151823) (duration: 00m 45s) [14:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:07] T151823: Add IPv6 address for dashboard.wikiedu.org to the ratelimit exemptions - https://phabricator.wikimedia.org/T151823 [14:27:37] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:29:07] 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2834996 (10matmarex) So, I guess this is resolved? @joe @aaron Are there any follow-up tasks to be filed, or is T1238... [14:30:06] kart__: we're waiting mwext-testextension-php55 and mwext-testextension-hhvm jobs to run [14:32:29] Here we are [14:33:33] kart__: live on mwdebug1002 [14:33:37] Dereckson: done? [14:33:59] yes, and I sent your change on the mwdebug1002 server [14:34:07] This is the server replacing mw1099 [14:34:41] If you use the X Wikimedia Debug extension, it already has been upgraded [14:34:44] OK. Let me check if nothing breaks. [14:34:47] Yep [14:35:11] hi mafk [14:35:45] I were going to skip your change, as I don't have a lot of ideas about how to test it, Urbanecm offered to create a test account and add it to the group, to check 2FA works. [14:35:49] sorry I'm late [14:36:06] I really wouldn't worry about testing it [14:36:08] ok [14:36:14] Dereckson: Just look at Special:UserGroupRights [14:36:18] Check it's been added to the group [14:36:20] That's enough :) [14:36:28] oh yes true there is an associated user right for that [14:36:55] Dereckson: go ahead. as not possible to test change without fully deployed code. [14:37:11] (03PS3) 10Dereckson: WMF staff local groups to $wmgPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324250 (https://phabricator.wikimedia.org/T150951) (owner: 10MarcoAurelio) [14:37:30] Dereckson: there's no need to add anyone, just look at special:listgrouprights if they have the oathauth-enable [14:37:41] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324250 (https://phabricator.wikimedia.org/T150951) (owner: 10MarcoAurelio) [14:37:47] tell me when ready on mw1099 [14:37:57] or mwdebug1002 now [14:39:15] (03Merged) 10jenkins-bot: WMF staff local groups to $wmgPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324250 (https://phabricator.wikimedia.org/T150951) (owner: 10MarcoAurelio) [14:39:19] kart__: syncing [14:39:57] mafk: live on mwdebug1002 [14:40:00] !log dereckson@tin Synchronized php-1.29.0-wmf.4/extensions/ContentTranslation/modules/tools/ext.cx.tools.template.js: Allow template editor even if parameter mapping fails completely (T151868) (duration: 00m 45s) [14:40:05] checking [14:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:14] T151868: Template translation is broken if none of the parameters can be auto-mapped - https://phabricator.wikimedia.org/T151868 [14:40:58] Dereckson: it's ok on debug [14:41:20] mafk: ok [14:41:22] Dereckson: thanks! [14:41:26] kart__: works? [14:41:43] Dereckson: we will only know later today :) [14:41:46] no worries. [14:41:47] ok [14:41:58] Fresh code+SWAT. [14:42:03] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add WMF staff local groups to $wmgPrivilegedGroups (T150951) (duration: 00m 46s) [14:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:14] T150951: Create list of privileged wiki groups - https://phabricator.wikimedia.org/T150951 [14:42:40] !log mysql restart and general upgrade for es2011 T151995 [14:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:52] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [14:44:05] !log EU SWAT done [14:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:14] \O/ [14:44:48] !log Stop MySQL and shutdown db2048 for maintenance - T149553 [14:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:56] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [14:52:26] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:33] (03PS1) 10Giuseppe Lavagetto: mediawiki: make jemalloc arenas equal to the processorcount [puppet] - 10https://gerrit.wikimedia.org/r/324462 (https://phabricator.wikimedia.org/T151702) [14:58:14] not 4*cpucount? [15:00:45] <_joe_> paravoid: looking at Tim's tests, and my own, it won't make much of a difference and I preferred to be a bit conservative given we're waiting to see what the rationale from fb is [15:00:54] 06Operations, 10Analytics, 10netops: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2835067 (10Ottomata) Uh OH! This is SUPPOSED to be in the analytics VLAN! https://phabricator.wikimedia.org/T149911 Re-opening that ticket. Sorry, I shoulda... [15:01:06] <_joe_> anyways, it won't make sense raising that much higher than 2*processorcount [15:01:15] <_joe_> or, the number of allowed hhvm threads [15:01:28] 06Operations, 06Analytics-Kanban, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2835068 (10Ottomata) 05Resolved>03Open Uh OH! @RobH > This should be installed within the Analytics VLAN, but it does not matter which row. I think thorium may have had... [15:01:59] 06Operations, 06Analytics-Kanban, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2835072 (10Ottomata) [15:02:02] 06Operations, 10Analytics, 10netops: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2835074 (10Ottomata) [15:11:20] !log mysql restart and general upgrade for es2012 T151995 [15:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:31] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [15:11:58] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: make jemalloc arenas equal to the processorcount [puppet] - 10https://gerrit.wikimedia.org/r/324462 (https://phabricator.wikimedia.org/T151702) (owner: 10Giuseppe Lavagetto) [15:14:13] <_joe_> !log upgrading HHVM on the imagescalers [15:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:36] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[strace] [15:19:26] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:20:36] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:25:25] 06Operations, 10Traffic: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2835098 (10doctaxon) @ema * In a window of 2016-11-29 19:00 until 23:59 CET I couldn't receive any 502 Bad Gateway doing a lot of API queries using a permanent query loop. * I could receive the last 502 Bad Gatewa... [15:26:20] 06Operations, 10Traffic, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2787072 (10ema) The errors mentioned in the ticket description seem to be gone from cp4001. [15:27:36] (03PS1) 10Cmjohnson: Adding mgmt dns entries for restabse1016-1018 T150964 [dns] - 10https://gerrit.wikimedia.org/r/324465 [15:28:35] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for restabse1016-1018 T150964 [dns] - 10https://gerrit.wikimedia.org/r/324465 (owner: 10Cmjohnson) [15:29:10] !log mysql restart and general upgrade for es2013 T151995 [15:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:22] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [15:29:55] 06Operations, 10Traffic: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2835114 (10ema) 05Open>03Resolved a:03ema @doctaxon great, thanks for confirming that the problem is solved. The default value for an internal varnish setting was not large enough and that was causing crashes wit... [15:38:43] (03PS1) 10Mforns: Add a reportupdater job for ee-migration [puppet] - 10https://gerrit.wikimedia.org/r/324466 (https://phabricator.wikimedia.org/T126358) [15:42:12] (03CR) 10Addshore: [C: 04-1 V: 04-1] "Extension configuration has changed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322086 (https://phabricator.wikimedia.org/T150945) (owner: 10Addshore) [15:43:52] (03PS1) 10Eevans: enable instance restbase2012-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/324469 (https://phabricator.wikimedia.org/T151086) [15:44:01] (03PS3) 10Gehel: elasticsearch - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/323154 (https://phabricator.wikimedia.org/T151325) [15:44:43] !log stopping for 24 hours cross-dc replication on shards es2,es3 codfw->eqiad (es1015, es1019) [15:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:02] (03CR) 10Eevans: [C: 031] "Ready to pull the trigger." [puppet] - 10https://gerrit.wikimedia.org/r/324469 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [15:45:43] (03CR) 10BryanDavis: [C: 031] Allow contentadmin and sysop to add/remove autopatrolled users on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324401 (owner: 10MarcoAurelio) [15:46:17] 06Operations, 10DBA, 13Patch-For-Review: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2835134 (10jcrespo) !log stopping for 24 hours cross-dc replication on shards es2,es3 codfw->eqiad (es1015, es1019) [15:50:37] (03CR) 10Gehel: "Puppet compiler agrees this is a noop on production systems" [puppet] - 10https://gerrit.wikimedia.org/r/323154 (https://phabricator.wikimedia.org/T151325) (owner: 10Gehel) [15:50:37] !log mysql restart and general upgrade for es2016 T151995 [15:50:41] (03PS4) 10Gehel: elasticsearch - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/323154 (https://phabricator.wikimedia.org/T151325) [15:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:48] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [15:51:47] (03CR) 10Gehel: [C: 032] elasticsearch - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/323154 (https://phabricator.wikimedia.org/T151325) (owner: 10Gehel) [16:01:08] (03PS2) 10Dzahn: enable instance restbase2012-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/324469 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [16:05:36] (03PS1) 10Chad: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324475 [16:06:00] (03CR) 10Chad: [C: 04-2] "this iz 4 l8r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324475 (owner: 10Chad) [16:07:06] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to beta cluster - https://phabricator.wikimedia.org/T150945#2802120 (10Addshore) a:03Addshore [16:08:13] 06Operations, 10ops-eqiad, 10hardware-requests: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares - https://phabricator.wikimedia.org/T146171#2835197 (10RobH) a:05Joe>03RobH Nope, it should come back to me to go back on spares, stealing! [16:11:26] (03CR) 10Dzahn: [C: 032] enable instance restbase2012-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/324469 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [16:11:52] urandom: ^ [16:12:24] mutante: thanks! [16:12:56] !log mysql restart and general upgrade for es2018 T151995 [16:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:09] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [16:13:26] mutante: and it looks to be off and running on it's own [16:14:00] urandom: you mean in a good way, it does it all by itself? [16:14:11] mutante: yeah [16:14:15] ok great :) [16:15:26] 06Operations, 10Wikimedia-General-or-Unknown, 07Availability, 13Patch-For-Review, and 2 others: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2835238 (10Joe) @matmarex I can confirm the jobqueue is now under control and I think the only real thing missing is... [16:15:55] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2835239 (10Joe) I have set arenas for jemalloc to be equal to the number of processors seen by the OS, the bandaid fix should be in the process of being remo... [16:17:46] PROBLEM - HHVM jobrunner on mw1167 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [16:18:46] RECOVERY - HHVM jobrunner on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [16:18:48] <_joe_> !log rolling upgrade of HHVM on the jobrunner, terbium/tin/wasat/mira [16:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:35] (03PS6) 10Andrew Bogott: bigbrother: Rewrite as python script [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) (owner: 10BryanDavis) [16:19:56] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [16:20:56] RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [16:21:00] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2835255 (10Joe) So, with the HHVM part "solved" we still should take the prevention measures I named here: - Check the concurrency/retry/timeout rates of al... [16:22:06] 06Operations, 06Analytics-Kanban, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2835257 (10RobH) Correct, it was installed in the internal vlan, my bad! It'll need reinstallation, as well as the dns and network port being updated. [16:24:26] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.046 second response time [16:25:01] (03PS2) 10Addshore: DNM config for ElectronPdfService on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322086 (https://phabricator.wikimedia.org/T150945) [16:25:26] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.005 second response time [16:29:04] PROBLEM - HHVM jobrunner on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [16:30:04] RECOVERY - HHVM jobrunner on mw1162 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [16:30:29] (03CR) 10Andrew Bogott: [C: 032] bigbrother: Rewrite as python script [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) (owner: 10BryanDavis) [16:31:04] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [16:31:44] PROBLEM - cassandra-c CQL 10.192.48.70:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.70 and port 9042: Connection refused [16:32:04] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [16:34:07] (03PS6) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [16:34:12] (03PS7) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [16:34:19] 06Operations, 06Analytics-Kanban, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2835285 (10Ottomata) ​Ok, it can be reinstalled at will. The puppet that is in place is fine (it might fail on the first run). Let me know when it is back up and I will make s... [16:35:09] (03PS1) 10Jcrespo: mariadb: Depool es1012 for maintenance and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324481 (https://phabricator.wikimedia.org/T151995) [16:36:04] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [16:36:26] (03PS1) 10ArielGlenn: pick up privatewikis fact from mediawiki config file [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/324482 [16:36:44] PROBLEM - puppet last run on mw1306 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm],Package[hhvm-dbg] [16:37:04] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [16:37:17] (03PS1) 10Cmjohnson: Revert "Adding mgmt dns entries for restabse1016-1018 T150964" [dns] - 10https://gerrit.wikimedia.org/r/324484 [16:38:30] gerrit.wm.org is slow for me [16:38:35] mutante apergos ^^ [16:38:58] google loads fine so it's not mine internet [16:38:58] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.70:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.70 and port 9042: Connection refused eevans Bootstrapping [16:39:04] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [16:39:08] ostriches ^^ [16:39:15] <_joe_> paladox: same issue here [16:39:25] I think this is gc again [16:39:34] slow here also [16:39:49] <_joe_> paladox: I'll asbstain from guessing [16:40:04] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [16:40:20] Ok [16:41:11] 2016-11-30T16:40:33.633+0000: 585655.771: [GC (Allocation Failure) [16:41:27] cpu looks to be very high on https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cobalt.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_%7C_network [16:41:43] (03Abandoned) 10Cmjohnson: Revert "Adding mgmt dns entries for restabse1016-1018 T150964" [dns] - 10https://gerrit.wikimedia.org/r/324484 (owner: 10Cmjohnson) [16:41:58] * cwd too: gerrit.wikimedia.org took too long to respond. [16:42:43] we are looking at it [16:42:56] root@cobalt:/var/lib/gerrit2/review_site/logs# tail -f error_log [16:43:03] at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:1391) [16:43:16] On it [16:43:19] cool [16:43:36] gc logs look ok actually [16:43:41] what else we have going on? [16:43:44] PROBLEM - DPKG on mw1168 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:43:45] (03PS3) 10Addshore: Enable ElectronPdfService extension on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322086 (https://phabricator.wikimedia.org/T150945) [16:43:47] (03PS1) 10Addshore: Enable ElectronPdfService extension on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324487 (https://phabricator.wikimedia.org/T150944) [16:43:48] confirming that, gc log looks like it was fast [16:43:52] (03PS1) 10Addshore: Enable ElectronPdfService extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324488 (https://phabricator.wikimedia.org/T150943) [16:43:54] (03PS1) 10Addshore: Enable ElectronPdfService extension on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324489 (https://phabricator.wikimedia.org/T150942) [16:44:06] <_joe_> it's failing its gc cycles, quite simply [16:44:18] <_joe_> apergos: no they don't [16:44:44] RECOVERY - DPKG on mw1168 is OK: All packages OK [16:44:57] it doesn't seem to be pausing for a long time for any of the cycles [16:45:07] <_joe_> -Xmx28g [16:45:10] and there aren't very many of these cycles in the last few minutes [16:45:26] this feels different from the former gc slowdowns [16:45:39] _joe_: what is it you are seeing? [16:45:40] <_joe_> uhm no that's actually ok [16:45:42] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [16:45:44] and there is the stuff about findObjectsToPackUsingBitmaps(PackWriter.java:1822 [16:45:47] jgit [16:45:54] That's more concerning [16:45:55] jgit again [16:46:09] is it missing objects again? [16:46:29] Hmmm [16:46:33] well, it did something to "find objects" and then it crashed ? [16:46:40] Oh [16:47:05] The missing objects triggered by upload-pack are spammy, but harmless, effectively. [16:47:05] I guess it may be finding that object that we had a problem with at the weekend? [16:47:14] But trying to find object to pack using bitmaps seems bad. [16:47:25] (03PS8) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [16:47:55] there a new error popped up [16:48:09] Internal error during upload-pack [16:48:21] Missing commit [16:48:42] which commit is missing? [16:49:10] 89af503db5298364bd77b2ecf997f0a88edc67e2 [16:49:32] (03CR) 10Jcrespo: "Yes, this with https://gerrit.wikimedia.org/r/324153 + a few changes on the template will work." (031 comment) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/324482 (owner: 10ArielGlenn) [16:49:34] upload-pack is spammy but harmless [16:49:37] Ignore it, red herring [16:49:44] sudo lsof -u gerrit2 | wc -l [16:49:44] 4565 [16:49:48] ^ That's more interesting. [16:49:58] Tons of low-traffic repos open by gerrit. [16:50:15] something that is a similar http://stackoverflow.com/questions/34654723/gerrit-to-gerrit-replication-issue [16:50:25] just i carn't see if they had performance issues when it happened [16:50:42] That's not even remotely related. [16:50:43] (03CR) 10Giuseppe Lavagetto: docker: add package provider (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323815 (owner: 10Giuseppe Lavagetto) [16:50:47] <_joe_> akosiaris: ^^ [16:51:10] ostriches, is it safe to use gerrit to deploy to scap right now or should I wait? [16:51:29] btw, it's working again for me [16:51:35] and the load is totally down now [16:51:38] ostriches would that be "•Prevent double closing of repository when merging changes." [16:51:44] No. [16:51:49] ok [16:52:10] jynus: Gimmie just another minute or two [16:52:42] sure [16:55:19] jynus: Go ahead. [16:56:05] (03CR) 10Chad: [C: 031] "Actually let's land this today." [puppet] - 10https://gerrit.wikimedia.org/r/323655 (https://phabricator.wikimedia.org/T151676) (owner: 10Reedy) [16:56:44] 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Enable Kafka native TLS in 0.9 and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#2835376 (10Ottomata) [16:56:51] +1 on the disable [16:57:48] The load spikes caused by auto-gcs are far worse than the savings we get from running gc's [16:57:57] stupid jgit. i hates u [16:58:01] :-d [16:58:22] haters gonna hate [16:58:38] thats just a bug and should be fixed in the next update. [16:58:50] unless another bug peaks in [16:58:58] (03PS2) 10Dzahn: Disable git gc as source of breakages [puppet] - 10https://gerrit.wikimedia.org/r/323655 (https://phabricator.wikimedia.org/T151676) (owner: 10Reedy) [16:59:09] +1 doing [16:59:12] (03CR) 10Jcrespo: pick up privatewikis fact from mediawiki config file (031 comment) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/324482 (owner: 10ArielGlenn) [16:59:20] I probably won't ever re-enable it. I don't trust jgit gc. [16:59:23] what all of you said [16:59:25] Fix a bug: 3 more appear. [16:59:46] yep [16:59:50] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1012 for maintenance and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324481 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [17:00:59] (03PS3) 10Dzahn: gerrit: Disable git gc as source of breakages [puppet] - 10https://gerrit.wikimedia.org/r/323655 (https://phabricator.wikimedia.org/T151676) (owner: 10Reedy) [17:01:06] (03CR) 10Dzahn: [C: 032 V: 032] gerrit: Disable git gc as source of breakages [puppet] - 10https://gerrit.wikimedia.org/r/323655 (https://phabricator.wikimedia.org/T151676) (owner: 10Reedy) [17:01:46] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1012 (duration: 00m 53s) [17:01:54] woo hoo [17:01:57] kill it dead [17:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:25] !log gerrit restarting to disable gc, config change 323655) [17:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:20] done [17:03:44] RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:03:52] (03PS2) 10ArielGlenn: pick up privatewikis fact from mediawiki config file [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/324482 [17:04:33] 06Operations, 10ops-eqiad, 13Patch-For-Review: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964#2835404 (10Cmjohnson) All 3 restbase servers are racked and have mgmt access. I did not do production dns. Switch config is completed as well. I set them up for 3 production c... [17:04:45] (03CR) 10ArielGlenn: pick up privatewikis fact from mediawiki config file (032 comments) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/324482 (owner: 10ArielGlenn) [17:05:14] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [17:07:21] 06Operations, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2835414 (10Arthur2e5) > but even identifiers are in Chinese in some places There were actually plans to use Chinese class & filenames in the java servlet (halted due to co... [17:10:52] (03CR) 10Jcrespo: "We should now confine this to the mariadb::sanitarium and mariadb::sanitarium2 roles." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/324482 (owner: 10ArielGlenn) [17:12:06] (03CR) 10Dzahn: [C: 04-1] "i think this should be done in the role instead where $domain and $altdom are being set" [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [17:13:14] (03CR) 10Dzahn: "i mean, i see this is also "in the role" but further up we just set $domain which is then being used in the "base uri" string." [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [17:16:40] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2835456 (10greg) >>! In T151702#2835255, @Joe wrote: > So, with the HHVM part "solved" we still should take the prevention measures I named here: > > - Chec... [17:20:21] (03PS9) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [17:20:25] (03PS10) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [17:20:53] !log mysql restart and general upgrade for es1012 T151995 [17:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:04] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [17:21:54] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. [17:22:04] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:29:52] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2835543 (10Joe) @greg yeah I know, I'll do my homework, promised :) I'm just waiting to see if the issue happens again in the next couple of days before clo... [17:33:14] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:37:14] (03PS1) 10Chad: ExtDist: REL1_28 default, REL1_29 added (commented), REL1_26 removed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324505 [17:39:19] (03CR) 10Chad: [C: 032] ExtDist: REL1_28 default, REL1_29 added (commented), REL1_26 removed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324505 (owner: 10Chad) [17:39:54] (03Merged) 10jenkins-bot: ExtDist: REL1_28 default, REL1_29 added (commented), REL1_26 removed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324505 (owner: 10Chad) [17:43:24] (03CR) 10Addshore: "Repo requested at https://www.mediawiki.org/w/index.php?title=Git/New_repositories/Requests/Entries&diff=2298621&oldid=2298115" [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) [17:44:31] (03CR) 10ArielGlenn: "Hm, so apparently facts are available on all nodes, whether or not the module itself is applied there. Maybe we would be better off with " [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/324482 (owner: 10ArielGlenn) [17:45:05] !log demon@tin Synchronized wmf-config/CommonSettings.php: extdist stuffs (duration: 00m 46s) [17:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:04] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [17:47:04] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4757321 keys, up 30 days 9 hours - replication_delay is 0 [17:51:04] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:05:36] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1012 for maintenance and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324512 [18:11:51] 06Operations, 10Traffic, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2835710 (10fgiunchedi) @ema indeed the new upstream version should fix the errors. What's left to figure out is what to... [18:13:16] (03CR) 10Jcrespo: "I am not sure I want this running on all all boxes, and getting random facts whenever that file randomly exist for other reasons. Maybe so" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/324482 (owner: 10ArielGlenn) [18:15:26] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1012 for maintenance and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324512 (owner: 10Jcrespo) [18:17:24] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%) [18:19:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1012 (duration: 00m 46s) [18:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:28] ACKNOWLEDGEMENT - Check for gridmaster host resolution TCP on labtest-ns0.wikimedia.org is CRITICAL: DNS CRITICAL - 0.160 seconds response time (No ANSWER SECTION found) daniel_zahn duration 200 days [18:22:29] ACKNOWLEDGEMENT - Check for gridmaster host resolution UDP on labtest-ns0.wikimedia.org is CRITICAL: DNS CRITICAL - 0.114 seconds response time (No ANSWER SECTION found) daniel_zahn duration 200 days [18:22:53] 06Operations, 10ops-eqiad, 13Patch-For-Review: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964#2835751 (10fgiunchedi) >>! In T150964#2835404, @Cmjohnson wrote: > All 3 restbase servers are racked and have mgmt access. I did not do production dns. Switch config is comple... [18:24:41] (03PS1) 10Jcrespo: mariadb: Depool es1013 for maintenance and general upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324518 (https://phabricator.wikimedia.org/T151995) [18:24:44] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:25:24] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [18:25:51] !log labnet2001 - ran low on disk, gzipped large /var/log/upstart/nova-api.log.1 / apt-get clean [18:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:04] labtestnet2001..fixing in wiki [18:27:20] !log last log message was about "labtestnet2001" not "labnet2001" [18:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:25] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1013 for maintenance and general upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324518 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [18:29:54] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:29:59] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2835766 (10mmodell) hmm... I'm not sure what's up with that. Can we just mount it as tmpfs? [18:30:44] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [18:31:33] (03PS2) 10Jcrespo: mariadb: Depool es1013 for maintenance and general upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324518 (https://phabricator.wikimedia.org/T151995) [18:36:44] (03CR) 10Jcrespo: [C: 032 V: 032] mariadb: Depool es1013 for maintenance and general upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324518 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [18:39:24] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1013 (duration: 00m 45s) [18:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:25] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2835781 (10Dzahn) Does Phabricator have a config option where to store the temp files? [18:43:07] !log mysql restart and general upgrade for es1013 T151995 [18:43:18] (03PS1) 10RobH: thorium should be in analytics vlan [dns] - 10https://gerrit.wikimedia.org/r/324527 [18:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:20] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [18:44:05] (03CR) 10RobH: [C: 032] thorium should be in analytics vlan [dns] - 10https://gerrit.wikimedia.org/r/324527 (owner: 10RobH) [18:48:06] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2835806 (10Paladox) I think it is https://secure.phabricator.com/book/phabricator/article/configuring_file_storage/ "Engine: Local Disk" [18:51:13] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure: remove/fix "Check for gridmaster host resolution" Icinga check for "labtest" - https://phabricator.wikimedia.org/T152024#2835831 (10Dzahn) p:05Triage>03Low [18:52:04] ACKNOWLEDGEMENT - Check for gridmaster host resolution TCP on labtest-ns0.wikimedia.org is CRITICAL: DNS CRITICAL - 0.160 seconds response time (No ANSWER SECTION found) daniel_zahn https://phabricator.wikimedia.org/T152024 [18:52:04] ACKNOWLEDGEMENT - Check for gridmaster host resolution UDP on labtest-ns0.wikimedia.org is CRITICAL: DNS CRITICAL - 0.114 seconds response time (No ANSWER SECTION found) daniel_zahn https://phabricator.wikimedia.org/T152024 [18:52:44] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:52:46] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure: remove/fix "Check for gridmaster host resolution" Icinga check for "labtest" - https://phabricator.wikimedia.org/T152024#2835820 (10Dzahn) [18:55:06] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1013 for maintenance and general upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324530 [18:55:45] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:59:25] (03PS1) 10Filippo Giunchedi: eventlogging: keepLastValue for eventlogging_NavigationTiming [puppet] - 10https://gerrit.wikimedia.org/r/324532 [18:59:31] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2835865 (10Dzahn) Thanks, this sounds like it ``` storage.local-disk.path: Set to some writable directory on local disk. Make that directory. ``` We could simply use /srv/tmp because there we have h... [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161130T1900). Please do the needful. [19:01:02] (03CR) 10Filippo Giunchedi: [C: 032] templates: add PTR for pdfrender [dns] - 10https://gerrit.wikimedia.org/r/324372 (owner: 10Filippo Giunchedi) [19:01:07] (03PS2) 10Filippo Giunchedi: templates: add PTR for pdfrender [dns] - 10https://gerrit.wikimedia.org/r/324372 [19:01:11] (03CR) 10Filippo Giunchedi: [V: 032] templates: add PTR for pdfrender [dns] - 10https://gerrit.wikimedia.org/r/324372 (owner: 10Filippo Giunchedi) [19:01:20] (03PS2) 10Ottomata: Allow misc directors to specify url path conditions as well as Host conditions [puppet] - 10https://gerrit.wikimedia.org/r/322964 [19:01:42] (03PS3) 10Ottomata: Allow misc directors to specify url path conditions as well as Host conditions [puppet] - 10https://gerrit.wikimedia.org/r/322964 [19:02:02] (03CR) 10Ottomata: [C: 031] eventlogging: keepLastValue for eventlogging_NavigationTiming [puppet] - 10https://gerrit.wikimedia.org/r/324532 (owner: 10Filippo Giunchedi) [19:02:38] (03PS2) 10Filippo Giunchedi: templates: add logstash.svc [dns] - 10https://gerrit.wikimedia.org/r/324373 (https://phabricator.wikimedia.org/T151971) [19:02:45] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2835893 (10Dzahn) Of course there should be some cleanup but we also don't have to make it hard on us by limiting ourselves to this tiny 10G / where you are constantly fighting to keep the remaining 2G fr... [19:04:22] !log demon@tin Synchronized php-1.29.0-wmf.4/includes/specials/SpecialUserrights.php: Ia0e583a5 (duration: 00m 45s) [19:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:31] (03PS2) 10RobH: thorium should be in analytics vlan [dns] - 10https://gerrit.wikimedia.org/r/324527 [19:05:33] Can I add stuff to SWAT now? [19:06:04] (03CR) 10Filippo Giunchedi: [C: 032] templates: add logstash.svc [dns] - 10https://gerrit.wikimedia.org/r/324373 (https://phabricator.wikimedia.org/T151971) (owner: 10Filippo Giunchedi) [19:06:44] (03Abandoned) 10RobH: thorium should be in analytics vlan [dns] - 10https://gerrit.wikimedia.org/r/324527 (owner: 10RobH) [19:06:54] (03PS2) 10Filippo Giunchedi: eventlogging: keepLastValue for eventlogging_NavigationTiming [puppet] - 10https://gerrit.wikimedia.org/r/324532 [19:09:50] (03PS1) 10RobH: thorium vlan change [dns] - 10https://gerrit.wikimedia.org/r/324537 [19:10:21] (03CR) 10RobH: [C: 032] thorium vlan change [dns] - 10https://gerrit.wikimedia.org/r/324537 (owner: 10RobH) [19:10:24] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /home 40583 MB (3% inode=98%) [19:11:13] sigh, ci/jenkins backed up again, a lot of jobs in the queue [19:11:59] godog: nodepool [19:12:08] "leaking instances" [19:12:11] and stuff [19:12:56] mhh that but also an influx of jobs backs things up, I think it is from core [19:13:09] 08:09 hashar migrated mediawiki-extensions-* tests to nodepool today [19:13:12] 08:10 but nodepool did get a bump in resources [19:13:20] the extensions [19:13:23] afaict [19:14:38] it could be a leak or all the nodepool instances have reached it's creation limit [19:14:56] (03CR) 10Filippo Giunchedi: [C: 032] eventlogging: keepLastValue for eventlogging_NavigationTiming [puppet] - 10https://gerrit.wikimedia.org/r/324532 (owner: 10Filippo Giunchedi) [19:15:07] mutante godog ^^ [19:16:05] eventually it flushes the queue of jobs heh so it makes progress, just that when hit with many jobs it takes a while to process all [19:16:25] At peak times normaly around now it will get slow [19:16:27] also afaict no priority among jobs so there can be starvation [19:16:59] mutante mediawiki-extensions havent actually been added to the test pipeline just they have been created [19:18:04] paladox: ah! ok [19:18:27] yep [19:19:18] (03PS1) 10Jcrespo: mariadb: depool db1017 for maintenance and general upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324539 (https://phabricator.wikimedia.org/T151995) [19:19:19] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure: remove/fix "Check for gridmaster host resolution" Icinga check for "labtest" - https://phabricator.wikimedia.org/T152024#2835945 (10Krenair) "gridmaster host resolution" is a tools project specific thing, why is it even in icinga instead of shinken? [19:19:28] mutante apparently you can disable [19:19:31] the storage engine [19:19:33] https://github.com/wikimedia/phabricator/blob/bf75469a3427f7b9bab9628f6c6a62ec8f7e7f1f/src/applications/files/config/PhabricatorFilesConfigOptions.php#L176 [19:19:39] on phabricator [19:20:09] It seems to be null already [19:20:16] So we must have it set some where in puppet [19:21:10] (03PS1) 10Ladsgroup: Add 'softest' values for ores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324540 (https://phabricator.wikimedia.org/T150224) [19:21:24] But i carn't find it in puppet [19:22:49] Anyone around to do swat? thcipriani ? [19:24:54] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [19:25:31] yup, I'm around. Didn't see anything earlier... [19:25:41] twentyafterfour why is mysql limit small on https://phabricator.wikimedia.org/applications/view/PhabricatorFilesApplication/ [19:25:42] ? [19:26:31] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2835964 (10fgiunchedi) @mmodell tmpfs won't help because it doesn't clean up old files by itself. @dzahn @Paladox yeah that might help if it actually gets honoured. My thought on that was that if all phab... [19:27:04] thcipriani: https://gerrit.wikimedia.org/r/324540 [19:27:16] and https://gerrit.wikimedia.org/r/324541 [19:27:27] before wmf.4 gets wikidata [19:27:38] it would be great [19:27:59] Amir1: sure I can get those out. Could you add them to the deployments page? [19:28:09] Doing it right now [19:28:13] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2835971 (10Paladox) @fgiunchedi hi, looking at it more it says we should be using MySQL by default so we may have a bug in that since it should not have set a local disk path as I see no setting has been s... [19:28:22] 06Operations, 10DBA: Rolling restart of parsercache servers for TLS certificate update - https://phabricator.wikimedia.org/T152029#2835972 (10jcrespo) [19:28:47] cool thanks :) [19:29:09] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2835986 (10faidon) JTAC thinks this may be [[ https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR957108 | PR957108 ]]: ``` Title: IPv6 neighbor... [19:29:32] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324540 (https://phabricator.wikimedia.org/T150224) (owner: 10Ladsgroup) [19:29:43] 06Operations, 10DBA: Rolling restart of parsercache servers for TLS certificate update - https://phabricator.wikimedia.org/T152029#2835972 (10jcrespo) p:05Normal>03High [19:29:58] 06Operations, 10Analytics: Install java 8 to stat1002 - https://phabricator.wikimedia.org/T151896#2835991 (10Ottomata) Hm hm. openjdk-7-jdk is required in a few classes that get included on stat1002. As long as the default alternative isn't updated as consequence of doing a `require_package('openjdk-8-jdk'0)... [19:30:05] (03Merged) 10jenkins-bot: Add 'softest' values for ores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324540 (https://phabricator.wikimedia.org/T150224) (owner: 10Ladsgroup) [19:30:28] (03PS2) 10Ottomata: Add a reportupdater job for ee-migration [puppet] - 10https://gerrit.wikimedia.org/r/324466 (https://phabricator.wikimedia.org/T126358) (owner: 10Mforns) [19:31:10] Amir1: https://gerrit.wikimedia.org/r/#/c/324540/1 is live on mwdebug1002 if there is anything to check there [19:31:26] (03PS1) 10Jcrespo: mariadb: Upgrade parsercache servers to use the puppet TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/324542 (https://phabricator.wikimedia.org/T152029) [19:31:28] yeah, let me check [19:32:58] thcipriani: the config one works like a charm [19:33:05] Amir1: ok, going live [19:33:48] (03CR) 10Jcrespo: [C: 032] mariadb: Upgrade parsercache servers to use the puppet TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/324542 (https://phabricator.wikimedia.org/T152029) (owner: 10Jcrespo) [19:34:30] (03CR) 10Ottomata: [C: 032] Add a reportupdater job for ee-migration [puppet] - 10https://gerrit.wikimedia.org/r/324466 (https://phabricator.wikimedia.org/T126358) (owner: 10Mforns) [19:34:41] (03PS3) 10Ottomata: Add a reportupdater job for ee-migration [puppet] - 10https://gerrit.wikimedia.org/r/324466 (https://phabricator.wikimedia.org/T126358) (owner: 10Mforns) [19:34:44] (03CR) 10Ottomata: [V: 032] Add a reportupdater job for ee-migration [puppet] - 10https://gerrit.wikimedia.org/r/324466 (https://phabricator.wikimedia.org/T126358) (owner: 10Mforns) [19:34:51] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:324540|Add "softest" values for ores]] T150224 (duration: 00m 46s) [19:34:53] Someone called this https://phabricator.wikimedia.org/applications/view/PhabricatorFilesApplication/ [19:34:54] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 2 minutes ago with 17 failures. Failed resources (up to 3 shown): Service[ferm],Service[diamond],Service[prometheus-node-exporter],Package[ecryptfs-utils] [19:34:58] blob store for pokemon [19:35:01] ^ Amir1 live everywhere now [19:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:03] T150224: Add "Lowest" ORES sensitivity for fpr=0.1 - https://phabricator.wikimedia.org/T150224 [19:35:07] mutante twentyafterfour ^^ lol [19:35:16] thcipriani: amazing [19:35:17] thanks [19:36:06] yw :) [19:36:28] Amir1: css change is for wmf.4 is live on mwdebug1002 [19:36:37] s/is// [19:36:54] since no wiki in group0 has ores enabled it's not testable [19:37:33] ok, will push live [19:38:09] In beta it's great: https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:RecentChanges&hidenondamaging=1 [19:38:35] One thing. Anyone admin on enwiki to make a mediawiki page for now? [19:38:53] (You can also use the staff account) [19:39:36] paladox: hehe, omg.. Pokémon is a trademark but only with the accent on the e :p [19:39:48] Yep lol [19:39:56] !log thcipriani@tin Synchronized php-1.29.0-wmf.4/extensions/ORES/modules/ext.ores.styles.css: SWAT: [[gerrit:324541|Use darker shade of yellow]] (duration: 00m 45s) [19:39:57] 06Operations, 10Traffic, 13Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#2823504 (10Volans) Quick summary from my tests of today: - most of the time is spent calling `_callBack()` in varnishapi.py:879` - regarding the 2 `while 1:`... [19:40:03] ^ Amir1 css change should be live [19:40:03] it is used for storing pokemon according to desc lol [19:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:11] odd I just lost my gerrit session [19:40:13] (03CR) 10ArielGlenn: "I see your point, will think about this some more." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/324482 (owner: 10ArielGlenn) [19:40:13] thcipriani: thanks :) [19:40:25] (03PS1) 10Jcrespo: mariadb: update my.cnf for parsercache tls implementation [puppet] - 10https://gerrit.wikimedia.org/r/324543 (https://phabricator.wikimedia.org/T152029) [19:41:07] godog: there was a service restart but a little over 2 hours ago [19:41:58] (03PS2) 10Jcrespo: mariadb: update my.cnf for parsercache tls implementation [puppet] - 10https://gerrit.wikimedia.org/r/324543 (https://phabricator.wikimedia.org/T152029) [19:43:09] can I quickly deploy some db pool/depools to mediawiki? [19:43:30] (03CR) 10Jcrespo: [C: 032] mariadb: update my.cnf for parsercache tls implementation [puppet] - 10https://gerrit.wikimedia.org/r/324543 (https://phabricator.wikimedia.org/T152029) (owner: 10Jcrespo) [19:43:47] (03CR) 10Filippo Giunchedi: prometheus: add vhtcpd stats via node-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) (owner: 10Filippo Giunchedi) [19:44:02] mutante: ah thanks, I've already logged in after that, bah [19:45:57] volans: I'm going to merge the vhtcp patch for now and defer to the task for the filename convention [19:46:14] I am going to asume nobody is deploying and going ahead [19:46:34] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1013 for maintenance and general upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324530 [19:46:46] 06Operations, 10ops-eqiad: return/replace bad JNP-QSFP- DAC-5M - https://phabricator.wikimedia.org/T152032#2836061 (10RobH) [19:47:04] 06Operations, 10ops-eqiad: return/replace bad JNP-QSFP- DAC-5M - https://phabricator.wikimedia.org/T152032#2836078 (10RobH) >>! In T149726#2763763, @Cmjohnson wrote: > The cable information > > 740-0328625 REV 01 5.0M > MOLEX QFSP+ 1110409057 REV 8 > MOC15506250085 MADE IN CHINA [19:47:25] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1013 for maintenance and general upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324530 (owner: 10Jcrespo) [19:47:38] (03PS2) 10Jcrespo: mariadb: depool db1017 for maintenance and general upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324539 (https://phabricator.wikimedia.org/T151995) [19:48:22] godog: ok, but let's find an agreement so that we have 1 way to do that and at least for new thing we stick on it [19:49:07] (03CR) 10Jcrespo: [C: 032] mariadb: depool db1017 for maintenance and general upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324539 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [19:50:17] yeah, the underscore/dash will be odd since it is just different conventions, anyways will comment on the task [19:50:50] that'd be T144169 [19:50:50] T144169: Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 [19:50:54] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1013; depool es1017 (duration: 00m 45s) [19:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:42] godog: yeah, thanks [19:53:14] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [19:54:14] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4756916 keys, up 30 days 11 hours - replication_delay is 59 [19:55:01] 06Operations, 10ops-eqiad: return/replace bad JNP-QSFP- DAC-5M - https://phabricator.wikimedia.org/T152032#2836101 (10RobH) Case ID 2016-1130-0744 has been created for you. [19:56:31] (03PS4) 10Filippo Giunchedi: prometheus: add vhtcpd stats via node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) [19:57:54] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add vhtcpd stats via node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) (owner: 10Filippo Giunchedi) [20:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161130T2000). [20:01:47] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [20:02:13] heh that's me, fix incoming [20:02:26] I might be able to race puppet fails [20:02:30] (03PS1) 10Filippo Giunchedi: prometheus: add missing 'd' to prometheus-vhtcp-stats [puppet] - 10https://gerrit.wikimedia.org/r/324545 [20:02:37] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [20:02:53] (03CR) 10Catrope: "FIXME: This should not have been deployed yet. Because https://gerrit.wikimedia.org/r/#/c/320328/ hasn't rolled out everywhere, the "softe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324540 (https://phabricator.wikimedia.org/T150224) (owner: 10Ladsgroup) [20:03:23] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] prometheus: add missing 'd' to prometheus-vhtcp-stats [puppet] - 10https://gerrit.wikimedia.org/r/324545 (owner: 10Filippo Giunchedi) [20:04:47] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:04:47] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [20:04:48] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [20:04:57] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [20:05:24] Hallo. [20:05:47] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [20:05:48] Is the train running today for non-Wikipedias + 2 Wikipedias? [20:05:57] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [20:06:08] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [20:06:47] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [20:06:47] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [20:07:09] that's me, recovering at the next puppet run [20:07:22] !log mysql restart and general upgrade for es1017 T151995 [20:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:33] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [20:09:01] aharoni, I think so [20:09:02] jouncebot, next [20:09:03] In 0 hour(s) and 50 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161130T2100) [20:09:06] hm [20:09:27] PROBLEM - puppet last run on db1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:09:45] aharoni, yeah, now, theoretically [20:10:37] jouncebot: now [20:10:37] For the next 1 hour(s) and 49 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161130T2000) [20:13:14] (03CR) 10Chad: [C: 032] group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324475 (owner: 10Chad) [20:13:49] (03Merged) 10jenkins-bot: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324475 (owner: 10Chad) [20:13:57] all aboard [20:14:47] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.4 [20:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:28] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2836208 (10Dzahn) Adding the cron with a "find" to delete older files won't be hard, but _how_ old is old enough to delete? [20:15:46] (03CR) 10Ladsgroup: "If that patch gets deployed before this one that would cause lots of inconsistency with data (because the default value is actually less t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324540 (https://phabricator.wikimedia.org/T150224) (owner: 10Ladsgroup) [20:21:27] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development: Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169#2590514 (10fgiunchedi) After some discussion in https://gerrit.wikimedia.org/r/#/c/323559/ I've changed my vote to "automatica... [20:21:43] (03PS1) 10Jcrespo: Revert "mariadb: depool db1017 for maintenance and general upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324548 [20:22:09] (03CR) 10Jcrespo: [C: 04-2] "Wait for buffer pool warmup." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324548 (owner: 10Jcrespo) [20:23:38] jouncebot: I'm done with the train deploy. Gonna get some lunch, you want any? [20:24:07] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:33] stoic jouncebot [20:29:37] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:32:47] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:32:47] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:32:57] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:32:57] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [20:33:12] (03Draft1) 10Paladox: Phabricator: Allow us to change the default web domain in apache [puppet] - 10https://gerrit.wikimedia.org/r/324551 [20:33:14] (03Draft2) 10Paladox: Phabricator: Allow us to change the default web domain in apache [puppet] - 10https://gerrit.wikimedia.org/r/324551 [20:33:36] 06Operations, 10Analytics: Install java 8 to stat1002 - https://phabricator.wikimedia.org/T151896#2836283 (10EBernhardson) I am specifically doing some machine learning exploration using RankLib, a java implementation of various ML ranking algorithms (that could perhaps be integrated to an elasticsearch plugin... [20:33:47] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:34:01] (03PS11) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [20:34:06] (03PS12) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [20:34:08] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:34:47] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [20:34:47] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:35:43] (03CR) 10Dzahn: [C: 04-1] "please don't call it "base_uri" when it's not. what you are looking up and then setting is just $domain and $altdom. Later in the code the" [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [20:36:03] (03PS13) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [20:37:05] (03PS14) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [20:38:23] (03PS2) 10Filippo Giunchedi: logstash: switch to /srv partitioning for ingester hosts [puppet] - 10https://gerrit.wikimedia.org/r/324362 (https://phabricator.wikimedia.org/T150108) [20:38:27] RECOVERY - puppet last run on db1052 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:39:54] (03CR) 10Filippo Giunchedi: [C: 032] logstash: switch to /srv partitioning for ingester hosts [puppet] - 10https://gerrit.wikimedia.org/r/324362 (https://phabricator.wikimedia.org/T150108) (owner: 10Filippo Giunchedi) [20:41:11] (03CR) 10Dzahn: [C: 04-1] "- you are changing "git.wikimedia.org" to "git-ssh.wikimedia.org" it looks" [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [20:41:45] (03CR) 10Paladox: "Woops sorry, fixing it now." [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [20:41:49] (03CR) 10Dzahn: "let's call it "altdomain" like before, instead of "security domain"" [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [20:42:19] (03PS15) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [20:44:40] (03PS16) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [20:44:47] (03PS17) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [20:45:47] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:50:18] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2836315 (10Florian) [20:51:07] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:51:20] (03PS2) 10Filippo Giunchedi: graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) [20:52:18] (03CR) 10jenkins-bot: [V: 04-1] graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [20:52:56] !log milimetric@tin Starting deploy [analytics/refinery@9cd8845]: (no message) [20:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:59] (03PS3) 10Filippo Giunchedi: graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) [20:55:54] (03CR) 10jenkins-bot: [V: 04-1] graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [20:55:59] !log milimetric@tin Finished deploy [analytics/refinery@9cd8845]: (no message) (duration: 03m 02s) [20:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:47] (03PS18) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [20:57:09] (03PS4) 10Filippo Giunchedi: graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) [20:57:50] (03PS19) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [21:00:04] ostriches why is the cpu at 92% https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cobalt.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_%7C_network [21:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161130T2100). [21:00:05] ? [21:00:26] Nothing for ORES [21:00:29] no parsoid deploy today [21:01:27] (03PS3) 10Andrew Bogott: Labs configuration for fi.wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/323698 (https://phabricator.wikimedia.org/T151570) (owner: 10MarcoAurelio) [21:02:43] paladox: l10nbot pushing to a billion repos, probably. [21:02:45] * ostriches shrugs [21:02:49] * ostriches goes back to lunch [21:02:52] Oh [21:03:01] ostriches but it keeps doing it every few hours? [21:07:18] Oh well [21:07:32] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2836410 (10Dzahn) today we disabled gc on gerrit completely https://gerrit.wikimedia.org/r/#/c/323655/ this was linked to T151676 a related ticket [21:07:35] (03CR) 10Andrew Bogott: [C: 032] Labs configuration for fi.wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/323698 (https://phabricator.wikimedia.org/T151570) (owner: 10MarcoAurelio) [21:08:17] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2824332 (10Dzahn) now gc is disabled. also see T148478 [21:09:14] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 30/11/2016 - https://phabricator.wikimedia.org/T148478#2724179 (10Dzahn) [21:09:18] (03PS11) 10Andrew Bogott: base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:09:47] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:10:18] (03CR) 10Andrew Bogott: [C: 032] base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:12:48] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:12:57] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:12:57] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:07] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/puppet-run] [21:13:07] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:07] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/puppet-run] [21:13:08] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:17] PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:17] PROBLEM - puppet last run on mw1283 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/puppet-run] [21:13:17] PROBLEM - puppet last run on mc1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:17] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:27] PROBLEM - puppet last run on db1088 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:27] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:27] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/puppet-run] [21:13:27] PROBLEM - puppet last run on mw2112 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:37] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:37] PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:37] PROBLEM - puppet last run on mw2243 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:37] PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:47] PROBLEM - puppet last run on elastic2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:47] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:47] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:47] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:47] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:47] PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:47] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:48] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:48] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:49] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/puppet-run] [21:13:49] PROBLEM - puppet last run on db2046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:50] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:07] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:07] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:07] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:07] PROBLEM - puppet last run on wtp1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:07] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:08] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:08] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:08] PROBLEM - puppet last run on db1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:08] PROBLEM - puppet last run on dbproxy1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:10] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:10] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:10] PROBLEM - puppet last run on mw2139 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:17] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:17] PROBLEM - puppet last run on elastic2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:17] PROBLEM - puppet last run on db2070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:17] PROBLEM - puppet last run on mc2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:27] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:27] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:27] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:27] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:27] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:27] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:27] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:37] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:37] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:37] PROBLEM - puppet last run on restbase2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:47] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:47] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:47] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:47] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:47] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:47] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:48] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:52] . [21:14:57] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:57] PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:57] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:57] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:57] PROBLEM - puppet last run on restbase-test2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:07] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:07] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:08] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:08] PROBLEM - puppet last run on mc1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:08] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:08] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:08] PROBLEM - puppet last run on mw2188 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:17] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:28] andrewbogott: change to base puppet? [21:15:28] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:28] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:28] PROBLEM - puppet last run on mw2240 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:28] PROBLEM - puppet last run on mw2083 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:37] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:37] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:37] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:37] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:37] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:37] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:38] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:45] hm... [21:15:47] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:47] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:47] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:47] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:47] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:47] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:47] mutante: let me look [21:15:47] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:48] PROBLEM - puppet last run on ganeti2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:48] PROBLEM - puppet last run on mw2080 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:49] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:49] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:50] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:57] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:57] PROBLEM - puppet last run on auth1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:57] PROBLEM - puppet last run on rdb2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:57] PROBLEM - puppet last run on db2045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:57] PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:57] PROBLEM - puppet last run on wtp2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:57] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:04] suggests to type /ignore icinga-wm [21:16:07] PROBLEM - puppet last run on mw2247 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:07] PROBLEM - puppet last run on mw2117 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:07] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:07] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:07] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:07] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:07] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:17] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:17] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:17] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:17] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:17] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:17] PROBLEM - puppet last run on mw2114 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:27] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:27] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:27] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:32] just for the moment [21:16:37] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:37] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:37] PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:37] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:37] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:37] PROBLEM - puppet last run on ganeti2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:37] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:38] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:47] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:47] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:47] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:48] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:48] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:48] PROBLEM - puppet last run on mw2082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:48] PROBLEM - puppet last run on mw2134 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:17:00] !log temp. stopping ircecho [21:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:21] (03PS1) 10Andrew Bogott: Revert "base: Allow auto puppetmaster switching tuning" [puppet] - 10https://gerrit.wikimedia.org/r/324568 [21:18:12] sorry for the noise, I'm investigating [21:18:34] andrewbogott: thanks [21:18:36] there is this [21:18:38] Could not understand source #!/bin/bash [21:19:19] i'll have an eye on icinga and the bot [21:20:22] that's the problem. But... [21:20:27] why is that not allowed in a .erb file? [21:21:34] (03CR) 10Andrew Bogott: [C: 032] Revert "base: Allow auto puppetmaster switching tuning" [puppet] - 10https://gerrit.wikimedia.org/r/324568 (owner: 10Andrew Bogott) [21:22:58] I'm off to lunch, brb [21:23:09] puppet should be happy again for now [21:23:21] I'm still confused about the erb parsing situation, but I'll make a test setup [21:23:36] thanks [21:24:21] yea, looks good, i'll get the bot back when the number of crits is down [21:24:34] the one i was one works [21:24:43] 06Operations, 10Domains, 10Traffic, 06WMF-Legal: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2836472 (10CRoslof) 05Invalid>03Resolved Update: The Wikimedia Foundation has acquired nlwikipedia.org [21:25:58] tickets with domains and legal on it and i see them resolved on IRC, love it [21:26:08] Yeah, great! :) [21:26:09] robh: [21:36:16] !log phab/iridium: deleting tmp files older than 2 weeks [21:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:10] andrewbogott: it should be content => for templates not source => [21:46:00] oh, maybe it's just that [21:46:03] thanks [21:46:11] jouncebot, refresh [21:46:13] jouncebot, next [21:46:14] (03PS1) 10Dzahn: phab: add cron to clean up old tmp files [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) [21:46:14] I refreshed my knowledge about deployments. [21:46:14] In 0 hour(s) and 13 minute(s): Finnish Wikivoyage wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161130T2200) [21:46:58] (03CR) 10jenkins-bot: [V: 04-1] phab: add cron to clean up old tmp files [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [21:48:04] (03PS2) 10Dzahn: phab: add cron to clean up old tmp files [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) [21:48:36] Krenair: Remember you'll need to do a parsoid service patch too. [21:48:58] that happens afterwards [21:48:59] (03CR) 10jenkins-bot: [V: 04-1] phab: add cron to clean up old tmp files [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [21:49:00] (Which can't be done until the new wiki is live, due to the way it pulls from the API or whatever.) [21:49:00] Yeh. [21:49:04] (03CR) 10Dzahn: "i feel like find is known working and easy enough (vs using tmpreaper)" [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [21:49:12] But ideally as part of the same deploy window. :-) [21:49:22] andrewbogott: yeah I think it is that judging from the error [21:50:20] on the change itself I always cringe a little when we're mixing programming and erb templates :( [21:50:57] James_F, ideally we'd actually be able to completely create a wiki [21:51:17] Krenair: Create wikis simply? You must be new here... [21:51:18] Krenair: Well yeah, the parsoid config issue is not great. [21:51:31] but since no one in ops knows all the steps and no one outside of ops has permissions for all the steps... [21:52:02] (03CR) 10Krinkle: phab: add cron to clean up old tmp files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [21:52:04] (03PS3) 10Dzahn: phab: add cron to clean up old tmp files [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) [21:52:15] we end up relying on multiple people to get it done [21:53:57] to create a new wiki you need to start with "labs db replica"/tell dba people [21:54:00] then DNS [21:54:06] then mw [21:54:37] sometimes you need to change apache [21:54:44] sometimes you don't need to change DNS [21:54:54] ack [21:55:16] and then a bunch of changes in other tools and projects that hold lists of all wikis [21:55:21] do you know exactly what the DBA will do? [21:55:27] i dont [21:55:39] but it's about having to prepare the replicas [21:55:47] before the wiki exists [21:55:53] or stuff is worse [21:56:22] or, if the wiki is private, prevent replications [21:56:25] so it's supposed to be step 1 of the workflow [21:56:25] replication* [21:57:13] there is also stuff like restbase config / services [21:57:29] you really got multi teams involved sometimes [21:57:53] the good thing is, our wikitech page that has docs is way better than it used to be [21:58:14] improvements from the last couple wiki creations for sure [21:58:23] so it's getting faster i think [21:59:32] the last wiki creation was pretty recent, so I am cautiously optimistic that addWiki.php might not die horribly in the middle of working [22:00:04] Krenair: Dear anthropoid, the time has come. Please deploy Finnish Wikivoyage wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161130T2200). [22:02:37] (03PS4) 10Dzahn: phab: add cron to clean up old tmp files [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) [22:03:20] Dereckson, ! [22:03:28] it didn't die horribly! [22:03:30] (03CR) 10Dzahn: phab: add cron to clean up old tmp files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [22:03:43] !log Ran mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=aawiki fi wikivoyage fiwikivoyage fi.wikivoyage.org [22:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:08] is it me or is stashbot slower than morebots? [22:04:42] grumble grumble... merge conflict due to wikiversions.json [22:06:55] !log bsitzmann@tin Starting deploy [mobileapps/deploy@d004bb4]: mobileapps deployment: 'Update service-mobileapp-node to 14deac7' [22:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:25] (03PS1) 10Andrew Bogott: base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/324610 (https://phabricator.wikimedia.org/T120159) [22:07:43] (03PS2) 10Alex Monk: Initial configuration for fi.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323695 (https://phabricator.wikimedia.org/T151570) (owner: 10MarcoAurelio) [22:07:44] Krenair: yeah [22:07:52] !log re-enabling puppet on einsteinium, starting ircecho [22:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:04] !log bsitzmann@tin Finished deploy [mobileapps/deploy@d004bb4]: mobileapps deployment: 'Update service-mobileapp-node to 14deac7' (duration: 01m 09s) [22:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:23] (03CR) 10Alex Monk: [C: 032] Initial configuration for fi.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323695 (https://phabricator.wikimedia.org/T151570) (owner: 10MarcoAurelio) [22:08:28] (03CR) 10jenkins-bot: [V: 04-1] base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/324610 (https://phabricator.wikimedia.org/T120159) (owner: 10Andrew Bogott) [22:08:43] We've a blessed end of 2016, with the ability to create wikis without script issues. Stay tuned for 2017 breakages. [22:08:57] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. [22:09:04] (03PS2) 10Andrew Bogott: base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/324610 (https://phabricator.wikimedia.org/T120159) [22:09:05] you had one a few weeks back though right? [22:09:11] yes [22:09:34] (03Merged) 10jenkins-bot: Initial configuration for fi.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323695 (https://phabricator.wikimedia.org/T151570) (owner: 10MarcoAurelio) [22:11:08] !log krenair@tin Synchronized dblists: https://gerrit.wikimedia.org/r/#/c/323695/ (duration: 00m 49s) [22:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:40] !log krenair@tin rebuilt wikiversions.php and synchronized wikiversions files: https://gerrit.wikimedia.org/r/#/c/323695/ [22:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:05] (03CR) 10Andrew Bogott: [C: 032] base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/324610 (https://phabricator.wikimedia.org/T120159) (owner: 10Andrew Bogott) [22:12:56] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/323695/ (duration: 00m 49s) [22:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:20] !log Ran mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php fiwikivoyage --backend=local-multiwrite [22:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:09] One of the proposals on the community wishlist is a regular backup of all the files on Commons. [22:14:18] pff, still not uid 1 [22:14:18] Is this not something that's already done, for example via dumps? [22:14:37] apergos: Maybe you know? I think you've worked on dumps, right? :-) [22:14:45] I'd ask Ariel [22:14:50] yeah [22:14:57] eh? [22:15:15] This was the discussion: https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Commons#Backup_of_Commons_files [22:15:37] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/puppet-run] [22:15:37] PROBLEM - puppet last run on restbase2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/puppet-run] [22:15:38] of all images? I see [22:15:47] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/puppet-run] [22:15:57] PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/puppet-run] [22:16:10] apergos: Yeah. I wasn't sure if this was something that was already done or not. [22:16:27] no, there used to be a live rsync mirror before images were moved to swift [22:16:33] but that's still not "backups" [22:16:40] !log Ran the dumpInterwiki.php script but it just produced the existing data, so nothing to do there [22:16:49] apergos: I see. Thanks! [22:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:56] backups implies that we save copies around. that's a lot of storage. for us or for someone else [22:16:58] (the wikidata sites population is running and is very unhappy) [22:17:02] yw [22:19:07] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern (or: revision retention policies considered harmful) - https://phabricator.wikimedia.org/T144431#2836606 (10GWicke) See T94121#2710479 for a summary of my earlier investigation of the wide row issue. [22:19:52] James_F, https://gerrit.wikimedia.org/r/324614 [22:19:56] apergos: btw, seems you were right about salt-minion runing twice, like it's an issue on trusty but not on jessie. also i can use debdeploy to restart services vs salt directly [22:19:58] gwicke, mobrovac_: hey [22:20:20] ah gtk [22:20:57] that gets me the debdeploy server groups [22:21:01] which are also salt grains [22:21:33] gwicke, mobrovac_: ready for https://gerrit.wikimedia.org/r/#/c/323696/ if ops are? [22:21:53] chasemp, able to run `maintain-views --databases fiwikivoyage --debug` on the labsdb hosts? [22:22:45] Krenair: I'm tied up atm but in a bit or if this drags then first thing the a.m. probably? [22:22:47] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [22:23:01] chasemp, okay. I'll leave a note so it shows up in your inbox [22:23:08] thanks [22:25:47] RECOVERY - cassandra-c CQL 10.192.48.70:9042 on restbase2012 is OK: TCP OK - 0.037 second response time on 10.192.48.70 port 9042 [22:28:21] Krenair: go for it [22:28:39] mutante, mind doing https://gerrit.wikimedia.org/r/#/c/323696/ ? [22:29:05] ah, yea, i can do that [22:29:31] (03PS4) 10Dzahn: RESTBase configuration for fi.wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/323696 (https://phabricator.wikimedia.org/T151570) (owner: 10MarcoAurelio) [22:29:37] (03CR) 10Dzahn: [C: 032] RESTBase configuration for fi.wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/323696 (https://phabricator.wikimedia.org/T151570) (owner: 10MarcoAurelio) [22:29:57] Krenair: Thanks! [22:30:27] (03CR) 10Dzahn: [V: 032] RESTBase configuration for fi.wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/323696 (https://phabricator.wikimedia.org/T151570) (owner: 10MarcoAurelio) [22:31:56] * Stryn waits for MF-Warburg to do the importing [22:32:19] (03PS1) 10ArielGlenn: make lock stale time for incremental dumps a lot shorter [puppet] - 10https://gerrit.wikimedia.org/r/324617 [22:33:04] duh, now i have 3 salt-minions on one server [22:33:45] I'm no salt expert but I'm pretty sure that's not supposed to happen [22:34:14] (03PS2) 10ArielGlenn: make lock stale time for incremental dumps a lot shorter [puppet] - 10https://gerrit.wikimedia.org/r/324617 [22:34:25] it's when you manually restart it, and then puppet wants to chime in and restart it another time .. and if it's trusty and depending how you start it [22:34:48] it behaves differently with upstart? [22:35:06] (03CR) 10ArielGlenn: [C: 032] make lock stale time for incremental dumps a lot shorter [puppet] - 10https://gerrit.wikimedia.org/r/324617 (owner: 10ArielGlenn) [22:36:00] (03PS2) 10Jcrespo: Revert "mariadb: depool db1017 for maintenance and general upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324548 [22:36:35] yea [22:38:43] mutante, did the RB change apply everywhere? [22:39:18] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: depool db1017 for maintenance and general upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324548 (owner: 10Jcrespo) [22:39:48] (03PS1) 10ArielGlenn: start adds/changes dumps cron job earlier [puppet] - 10https://gerrit.wikimedia.org/r/324618 [22:39:55] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1017 for maintenance and general upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324548 (owner: 10Jcrespo) [22:41:21] Krenair: i did not force the puppet run [22:41:31] (03CR) 10ArielGlenn: [C: 032] start adds/changes dumps cron job earlier [puppet] - 10https://gerrit.wikimedia.org/r/324618 (owner: 10ArielGlenn) [22:41:46] gwicke, we need puppet to run across all RB servers then you need to reload stuff, right? [22:42:28] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1017 (duration: 00m 45s) [22:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:37] RECOVERY - puppet last run on wtp2020 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [22:43:38] RECOVERY - puppet last run on restbase2012 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [22:43:47] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:43:57] RECOVERY - puppet last run on mw2075 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [22:47:19] (03PS1) 10ArielGlenn: tweak stale lock refresh interval for incr dumps [dumps] - 10https://gerrit.wikimedia.org/r/324619 [22:48:59] Krenair: mutante: 3 salt-minion processes is the default config [22:49:20] (03PS2) 10ArielGlenn: tweak stale lock refresh interval for incr dumps [dumps] - 10https://gerrit.wikimedia.org/r/324619 [22:49:53] (03PS5) 10Krinkle: phab: add cron to clean up old tmp files [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [22:49:57] (03CR) 10ArielGlenn: [C: 032] tweak stale lock refresh interval for incr dumps [dumps] - 10https://gerrit.wikimedia.org/r/324619 (owner: 10ArielGlenn) [22:49:59] (03CR) 10Krinkle: "Oh, I see what you mean now :)" [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [22:52:59] (03PS20) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [22:53:09] (03PS3) 10Paladox: Phabricator: Allow us to change the default web domain in apache [puppet] - 10https://gerrit.wikimedia.org/r/324551 [22:55:28] or not, I checked a server of mine, and saw 3 too, but https://github.com/saltstack/salt/issues/7733 and https://github.com/saltstack/salt/issues/12217 seems to indicate issues [22:55:43] !log ariel@tin Starting deploy [dumps/dumps@04a57c5]: (no message) [22:55:45] !log ariel@tin Finished deploy [dumps/dumps@04a57c5]: (no message) (duration: 00m 01s) [22:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:17] apergos: Whee !logging :D [22:56:29] indeed but I forgot to put a message in [22:56:43] and I was even happy to see that feature announced (saw it in mail today) [22:59:32] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2836718 (10greg) p:05Triage>03Normal [23:00:52] (03PS1) 10ArielGlenn: fix usage message [dumps] - 10https://gerrit.wikimedia.org/r/324622 [23:03:18] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 30/11/2016 - https://phabricator.wikimedia.org/T148478#2836728 (10Paladox) The cpu seems to be still very high https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellane... [23:05:06] gwicke, ping [23:05:43] (03CR) 10Paladox: [C: 031] "@Dzahn can you run puppet compiler please?" [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [23:05:49] (03CR) 10Paladox: [C: 031] "@Dzahn can you run puppet compiler please?" [puppet] - 10https://gerrit.wikimedia.org/r/324551 (owner: 10Paladox) [23:08:56] Krenair: pong [23:09:43] gwicke, how are the restarts going? [23:09:56] (03PS1) 10BryanDavis: toollabs: remove host aliases for tools-exec-12[01-11] [puppet] - 10https://gerrit.wikimedia.org/r/324623 (https://phabricator.wikimedia.org/T151980) [23:10:02] I'm not restarting anything [23:10:25] typically ops or Marko would do that after merging the puppet change [23:11:00] So when I asked if you were ready for the patch [23:11:02] And you said yes [23:11:05] What you meant was no [23:17:18] (03PS1) 10Alex Monk: Revert "RESTBase configuration for fi.wikivoyage.org" [puppet] - 10https://gerrit.wikimedia.org/r/324624 [23:17:23] mutante: https://gerrit.wikimedia.org/r/324624 [23:19:40] (03PS2) 10ArielGlenn: fix usage message [dumps] - 10https://gerrit.wikimedia.org/r/324622 [23:21:37] Krenair: oh, I thought you were asking about whether there is any reason to not deploy this now [23:21:48] sorry if I was was being unclear [23:22:18] a reason not to deploy this now would be that we're not ready to make production state reflect the filesystem/puppet [23:22:27] the process is documented at https://wikitech.wikimedia.org/wiki/RESTBase#Deploy_configuration_changes [23:22:30] Hey... we noticed that https://en.m.wikipedia.org/wiki/Portal:Space is linking to action=purge urls [23:22:34] im guessing it shouldnt do that? [23:22:48] (has a link Purge server cache) [23:22:53] (03PS3) 10ArielGlenn: fix usage message [dumps] - 10https://gerrit.wikimedia.org/r/324622 [23:22:57] no idea why. [23:23:54] jdlrobson, it'll be one of the templates in there, they're able to make links like that [23:24:12] (03CR) 10ArielGlenn: [C: 032] fix usage message [dumps] - 10https://gerrit.wikimedia.org/r/324622 (owner: 10ArielGlenn) [23:24:32] jdlrobson, it's https://en.wikipedia.org/wiki/Template:Purge_page [23:24:48] (03PS1) 10ArielGlenn: miscdumps: use log.info for verbose only [dumps] - 10https://gerrit.wikimedia.org/r/324625 [23:25:08] but should we be encouraging purging of pages by readers/crawlers? seems strange if you dont understand what that means [23:25:12] apparently a lot of portals use it: https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Template:Purge_page&limit=500 [23:25:42] probably not [23:26:29] (03CR) 10ArielGlenn: [C: 032] miscdumps: use log.info for verbose only [dumps] - 10https://gerrit.wikimedia.org/r/324625 (owner: 10ArielGlenn) [23:27:50] mm [23:27:54] thanks for the context Krenair [23:28:43] might be a good idea to find the revision which added that transclusion and find out why [23:37:03] (03CR) 10Filippo Giunchedi: [C: 04-1] "Trickier than it looks :(" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [23:44:09] hello Krenair, news from the recreation wiki (lol) ? [23:44:25] alchimista, what? [23:44:49] Krenair, https://phabricator.wikimedia.org/T126832 the wiki recreation, of ptwmp [23:45:03] what about it? [23:45:56] as far as i understand, there is no references, so the db can be "cleande" [23:46:24] *cleaned. How's the work going? [23:48:30] I'm not aware of anyone working on it [23:48:47] If someone was I would expect it to be clear on the task [23:49:53] Krenair: back, so we are reverting. [23:50:23] (03PS2) 10Dzahn: Revert "RESTBase configuration for fi.wikivoyage.org" [puppet] - 10https://gerrit.wikimedia.org/r/324624 (owner: 10Alex Monk) [23:50:59] (03CR) 10Dzahn: [C: 032] Revert "RESTBase configuration for fi.wikivoyage.org" [puppet] - 10https://gerrit.wikimedia.org/r/324624 (owner: 10Alex Monk) [23:51:20] 06Operations, 06Analytics-Kanban, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2836823 (10RobH) 05Open>03Resolved reinstalled, puppet and salt keys accepted. it has some puppet failures, but since those are service related, i'll leave them to you to... [23:57:09] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:57:41] (03PS21) 10Dzahn: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox)