[00:29:58] (03PS4) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 [01:02:15] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures [01:21:56] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1307.94 seconds [01:27:46] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:51:58] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.07 seconds [02:00:24] (03PS1) 10Yurik: (WIP) Notify TileratorUI on new expiry files [puppet] - 10https://gerrit.wikimedia.org/r/295450 (https://phabricator.wikimedia.org/T108459) [02:31:23] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.6) (duration: 10m 24s) [02:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:49] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.7) (duration: 17m 49s) [03:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:12:33] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jun 22 03:12:33 UTC 2016 (duration 6m 44s) [03:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:29:22] !log fix salt key on labtestmetal2001 [04:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:47:20] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 1 failures [05:13:13] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:29:08] (03PS1) 10KartikMistry: Deploy Compact Language Links as default (Stage 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295454 (https://phabricator.wikimedia.org/T136677) [05:34:39] (03PS1) 10Tim Landscheidt: labstore: Remove redundant calls to lower() for user names [puppet] - 10https://gerrit.wikimedia.org/r/295455 [05:35:43] (03CR) 10Tim Landscheidt: "String arithmetics:" [puppet] - 10https://gerrit.wikimedia.org/r/295455 (owner: 10Tim Landscheidt) [05:52:07] (03PS2) 10KartikMistry: Deploy Compact Language Links as default (Stage 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295454 (https://phabricator.wikimedia.org/T136677) [06:04:48] PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:36] RECOVERY - puppet last run on mw2132 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:30:56] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: puppet fail [06:31:27] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:27] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: puppet fail [06:31:47] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:07] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:07] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:32:17] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:48] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:08] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:27] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:56] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:56:26] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:56:38] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:57:37] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:48] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:57] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:07] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:07] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures [07:05:17] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [07:06:23] !log restarted hhvm on mw1131 [07:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:06:46] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [07:08:27] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.237 second response time [07:08:36] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 67924 bytes in 0.391 second response time [07:24:58] ACKNOWLEDGEMENT - Elasticsearch HTTPS on elastic1032 is CRITICAL: Use of uninitialized value sans in concatenation (.) or string at /usr/lib/nagios/plugins/check_ssl line 185. Muehlenhoff Host is in setup, see SAL [07:26:47] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:26:58] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 136540 MB (3% inode=99%) [07:30:49] !log stopping, backing up and reimaging db1061 and db1062 [07:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:37:54] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2398200 (10SindyM3) Done :D [07:51:04] I am waiting for log rotate to compress 147G file api.log-20160622 on fluorine [07:56:17] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 41 failures [07:59:01] !log rolling restart of hhvm/apache on canary app servers in eqiad for expat security update [07:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:08:27] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1021, Errmsg: Error Disk full (module_deps): waiting for someone to free some space... (errno: 189 Disk full) on query. Default database: bgwiki. Query: REPLACE /* DatabaseMysqlBase::replace */ INTO module_deps (md_module,md_skin,md_deps) VALUES (ext.wikimediaBadges,vector [08:08:50] grr [08:08:53] <_joe_> lol [08:09:12] <_joe_> I was reading the alert and thinking: jynus commenting in 3,2,... [08:09:17] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1021, Errmsg: Error Disk full (module_deps): waiting for someone to free some space... (errno: 189 Disk full) on query. Default database: commonswiki. Query: REPLACE /* DatabaseMysqlBase::replace */ INTO module_deps (md_module,md_skin,md_deps) VALUES (ext.wikimediaBadges,vector [08:09:27] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1021, Errmsg: Error Disk full (pagelinks): waiting for someone to free some space... (errno: 189 Disk full) on query. Default database: enwiki. Query: INSERT /* LinksUpdate::incrTableUpdate 127.0.0.1 */ IGNORE INTO pagelinks (pl_from,pl_from_namespace,pl_namespace,pl_title) VALUES (50822974,0,10,R_from_ambiguous_page),(50 [08:09:36] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1021, Errmsg: Error Disk full (text): waiting for someone to free some space... (errno: 189 Disk full) on query. Default database: huwiki. Query: INSERT /* Revision::insertOn Szilas */ INTO text (old_id,old_text,old_flags) VALUES (NULL,DB://cluster25/2692957,utf-8,gzip,external) [08:09:37] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1021, Errmsg: Error Disk full (text): waiting for someone to free some space... (errno: 189 Disk full) on query. Default database: wikidatawiki. Query: INSERT /* Revision::insertOn ShinePhantom */ INTO text (old_id,old_text,old_flags) VALUES (NULL,DB://cluster24/176847594,utf-8,gzip,external) [08:09:38] but why, there is 300GB left! [08:09:40] <_joe_> I can't really help over this network, sorry [08:09:57] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1021, Errmsg: Error Disk full (querycache_info): waiting for someone to free some space... (errno: 189 Disk full) on query. Default database: ruwiki. Query: REPLACE /* RecentChangesUpdateJob::{closure} */ INTO querycache_info (qci_type,qci_timestamp) VALUES (activeusers,20160621080106) [08:10:33] disk full ? with 300GB left ? [08:11:10] I mean, I have 7TB used [08:12:45] I think it is a toku-db only thing [08:13:13] mysql should send an io error otherwise [08:13:38] Uh [08:13:41] tokudb_fs_reserve_percent ? [08:13:49] ah it's 5% [08:13:54] jynus: spot on [08:14:16] another reason to hate toku [08:14:23] hehe [08:15:14] and of course the variable is not hot [08:15:20] so it will have to wait [08:16:03] it is delayed 24 hours, no problem if it gets delayed 25h [08:16:37] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:16:37] and I only added 163GB extra [08:22:04] nice, and I cannot now login into the management interfaze of db1061 [08:27:27] RECOVERY - Disk space on fluorine is OK: DISK OK [08:33:06] 06Operations, 10ops-eqiad: db1061 management interface needs physical reset - https://phabricator.wikimedia.org/T138368#2398311 (10jcrespo) [08:33:44] 06Operations, 10ops-eqiad: db1061 management interface needs physical reset - https://phabricator.wikimedia.org/T138368#2398324 (10jcrespo) This is blocking a time-sensitive reimage. [08:36:00] 06Operations, 10ops-eqiad: db1061 and db1062 management interface needs physical reset - https://phabricator.wikimedia.org/T138368#2398326 (10jcrespo) [08:36:44] akosiaris: thanks for the real_networks etherpad! I'll take a look [08:41:30] godog: be warned. it's just my thoughts as I tried to capture them in a pad yesterday. I may very well be wrong on some things. [08:41:59] RECOVERY - Elasticsearch HTTPS on elastic1032 is OK: SSL OK - Certificate elastic1032.eqiad.wmnet valid until 2021-06-21 08:40:25 +0000 (expires in 1824 days) [08:45:18] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:45:55] heheh ok! [08:46:07] PROBLEM - Apache HTTP on mw1140 is CRITICAL: HTTP CRITICAL - No data received from host [08:46:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] (WIP) Notify TileratorUI on new expiry files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295450 (https://phabricator.wikimedia.org/T108459) (owner: 10Yurik) [08:47:23] (03CR) 10Alexandros Kosiaris: [C: 031] openldap: enable the memberof overlay [puppet] - 10https://gerrit.wikimedia.org/r/295357 (owner: 10Faidon Liambotis) [08:47:28] PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:47:48] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:47:59] (03CR) 10Alexandros Kosiaris: [C: 031] svc: add graphite LVS addresses [dns] - 10https://gerrit.wikimedia.org/r/289635 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [08:48:08] PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:48:28] PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:48:47] PROBLEM - puppet last run on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:48:57] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:48:58] PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:48:58] PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:48:58] PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:49:09] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:49:28] PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:18] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [08:51:29] (03PS1) 10Jcrespo: Lower tokudb_fs_reserve_percent to 1% [puppet] - 10https://gerrit.wikimedia.org/r/295457 [08:53:30] (03PS2) 10Jcrespo: Lower tokudb_fs_reserve_percent to 1% [puppet] - 10https://gerrit.wikimedia.org/r/295457 [08:59:19] PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:09] 06Operations: LDAP Account required for Transparency Report - https://phabricator.wikimedia.org/T138369#2398350 (10siddharth11) [09:09:35] (03CR) 10Jcrespo: [C: 032] Lower tokudb_fs_reserve_percent to 1% [puppet] - 10https://gerrit.wikimedia.org/r/295457 (owner: 10Jcrespo) [09:12:28] RECOVERY - configured eth on mw1140 is OK: OK - interfaces up [09:12:29] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [09:13:09] RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient [09:13:28] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [09:13:29] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.038 second response time [09:13:29] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:13:39] RECOVERY - DPKG on mw1140 is OK: All packages OK [09:13:49] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:13:58] RECOVERY - Disk space on mw1140 is OK: DISK OK [09:14:08] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 59 minutes ago with 0 failures [09:14:29] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [09:14:30] RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 18 % full [09:14:38] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 67910 bytes in 0.118 second response time [09:15:44] (03PS3) 10Ema: tlsproxy: enable client/server TFO support in the kernel [puppet] - 10https://gerrit.wikimedia.org/r/295331 (https://phabricator.wikimedia.org/T108827) [09:17:00] (03CR) 10Ema: [C: 032 V: 032] tlsproxy: enable client/server TFO support in the kernel [puppet] - 10https://gerrit.wikimedia.org/r/295331 (https://phabricator.wikimedia.org/T108827) (owner: 10Ema) [09:17:44] jynus: looks like there is an unmerged change of yours on palladium (tokudb_fs_reserve_percent) [09:17:53] can I merge it? [09:18:08] I was doing the same [09:18:17] please do [09:18:26] jynus: done :) [09:19:48] !log stopping and reconfiguring mysql on dbstore1001 [09:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:20:58] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 75 failures [09:21:57] 06Operations: LDAP Account required for Transparency Report - https://phabricator.wikimedia.org/T138369#2398350 (10hashar) LDAP accounts are created via https://wikitech.wikimedia.org/ and you seems to already have one: https://wikitech.wikimedia.org/wiki/User:Siddparmar Your account is neither a member of LDA... [09:23:06] (03CR) 10Muehlenhoff: [C: 031] "Looks good. I tested what's needed to add memberOf attributes for existing group entries: The memberOf attributes on the user accounts are" [puppet] - 10https://gerrit.wikimedia.org/r/295357 (owner: 10Faidon Liambotis) [09:28:49] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:28:49] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:28:50] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:36:51] 06Operations, 07Graphite: Grafana login issue for @thiemowmde - https://phabricator.wikimedia.org/T135994#2398393 (10thiemowmde) 05Open>03Resolved a:03thiemowmde I tried again with Chromium and can login, but can't with Firefox. So obviously something Firefox does different (encoding, obviously). Not to... [09:43:04] !log live-hacking on mw1017 to debug T115119 [09:43:05] T115119: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119 [09:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:59] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:18] !log rolling restart of restbase in codfw to pick up firejail change in service::node [09:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:47:08] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 67903 bytes in 0.819 second response time [10:07:42] (03Abandoned) 10Mobrovac: service::node: Output stdout and stderr seen by firejail to a log file [puppet] - 10https://gerrit.wikimedia.org/r/294499 (owner: 10Mobrovac) [10:16:09] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:00] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2398559 (10fgiunchedi) switch port configuration wasn't correct (`ge` vs `xe` ports names), I've fixed that and was able to pxe-boot ms-be2022 [10:17:18] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:38] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:18:55] !log rolling restart of restbase in eqiad to pick up firejail change in service::node [10:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:19:29] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:19:29] PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:19:39] PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:18] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:20] PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:29] PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:50] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [10:25:50] RECOVERY - Disk space on mw1140 is OK: DISK OK [10:27:37] again? [10:29:18] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#1936600 (10jcrespo) There is indeed a replacement for labsdb100[123] about to arrive. However, there are no short-term plans for these, as they have lower impact. labsdb10... [10:29:21] lunch & [10:29:48] PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:00] going to check mw1140, memory pressure seems to be the cause [10:31:19] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:32:06] !log mw1140 powercycle after freeze issues due to memory pressure (was not able to ssh to it) [10:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:32:20] PROBLEM - Disk space on mw1140 is CRITICAL: Timeout while attempting connection [10:33:01] PROBLEM - nutcracker port on mw1140 is CRITICAL: Timeout while attempting connection [10:33:11] PROBLEM - nutcracker process on mw1140 is CRITICAL: Timeout while attempting connection [10:33:51] RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 0 % full [10:33:52] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 2 processes with command name hhvm [10:34:01] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [10:34:21] RECOVERY - configured eth on mw1140 is OK: OK - interfaces up [10:34:30] RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient [10:34:50] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [10:35:01] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:35:11] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.355 second response time [10:35:30] RECOVERY - DPKG on mw1140 is OK: All packages OK [10:35:50] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:35:51] RECOVERY - Disk space on mw1140 is OK: DISK OK [10:36:22] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 67903 bytes in 0.595 second response time [10:37:40] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:46:37] !log upload libphutil/arcanist 0~git20160620-0wmf1 to carbon [10:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:53:39] 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2398667 (10jcrespo) I have not forgotten about this, it is on 'Next', blocked on me having proper time (there is n... [10:59:27] 06Operations, 10LDAP-Access-Requests: LDAP Account required for Transparency Report - https://phabricator.wikimedia.org/T138369#2398687 (10Peachey88) [11:02:13] 06Operations, 06Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2398690 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:15:07] going afk for lunch, mw128[789] and mw1290 are new api appservers. I can't directly silencing them until they show up in icinga, so if you see some spam please be patient :) [11:15:09] (03PS1) 10Filippo Giunchedi: install_server: add prometheus2001 [puppet] - 10https://gerrit.wikimedia.org/r/295470 (https://phabricator.wikimedia.org/T136313) [11:16:58] 06Operations: Frequent segfaults of rsvg-convert on image scalers - https://phabricator.wikimedia.org/T137876#2398698 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:20:18] 06Operations, 06Commons, 10Wikimedia-SVG-rendering: SVG files larger than 10 MB cannot be thumbnailed - https://phabricator.wikimedia.org/T111815#2398704 (10MoritzMuehlenhoff) [11:20:34] 06Operations, 06Commons, 10Wikimedia-SVG-rendering: SVG files larger than 10 MB cannot be thumbnailed - https://phabricator.wikimedia.org/T111815#1616960 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:31:05] (03PS1) 10Filippo Giunchedi: swift: add ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/295472 (https://phabricator.wikimedia.org/T136630) [11:31:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: add prometheus2001 [puppet] - 10https://gerrit.wikimedia.org/r/295470 (https://phabricator.wikimedia.org/T136313) (owner: 10Filippo Giunchedi) [11:31:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: add ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/295472 (https://phabricator.wikimedia.org/T136630) (owner: 10Filippo Giunchedi) [11:31:56] (03CR) 10Gehel: "Puppet compiler looks good https://puppet-compiler.wmflabs.org/3159/" [puppet] - 10https://gerrit.wikimedia.org/r/295369 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [11:34:51] (03PS1) 10Gehel: Configuring new elastic1033-1037 servers [puppet] - 10https://gerrit.wikimedia.org/r/295473 (https://phabricator.wikimedia.org/T138329) [11:35:37] (03CR) 10Gehel: [C: 04-1] "Minor fix to HTTPS required before merging this. Fix coming up right now." [puppet] - 10https://gerrit.wikimedia.org/r/295473 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [11:36:57] (03PS2) 10Gehel: Adding missing dependency in exposing puppet SSL certs on elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/295369 (https://phabricator.wikimedia.org/T138329) [11:38:31] (03CR) 10Gehel: [C: 032] Adding missing dependency in exposing puppet SSL certs on elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/295369 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [11:38:52] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: Puppet has 12 failures [11:41:31] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#461916 (10MoritzMuehlenhoff) That bug is fixed on the new jessie image scaler using 2.4.16 (tested locally, it's not ye... [11:41:49] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#2398736 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:51:23] (03CR) 10BBlack: [C: 031] lvs: rate-limit more ICMP codes, lower to 1/200ms [puppet] - 10https://gerrit.wikimedia.org/r/294467 (https://phabricator.wikimedia.org/T136939) (owner: 10Faidon Liambotis) [11:54:32] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:59:10] (03PS2) 10Gehel: Configuring new elastic1033-1037 servers [puppet] - 10https://gerrit.wikimedia.org/r/295473 (https://phabricator.wikimedia.org/T138329) [12:00:47] (03CR) 10Gehel: [C: 032] Configuring new elastic1033-1037 servers [puppet] - 10https://gerrit.wikimedia.org/r/295473 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [12:06:27] !log configuring new elasticsearch servers elastic1033-1037 in eqiad [12:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:14:24] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2398786 (10JeanFred) 05Open>03Resolved Thanks all for this! :) [12:23:41] (03PS6) 10Yuvipanda: ores: fix workers and config [puppet] - 10https://gerrit.wikimedia.org/r/293904 (owner: 10Ladsgroup) [12:23:47] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: fix workers and config [puppet] - 10https://gerrit.wikimedia.org/r/293904 (owner: 10Ladsgroup) [12:27:16] (03PS1) 10Alexandros Kosiaris: ldaplist: Allow searching for more than attribute [puppet] - 10https://gerrit.wikimedia.org/r/295475 [12:30:25] (03PS2) 10Alexandros Kosiaris: ldaplist: Allow searching for more than attribute [puppet] - 10https://gerrit.wikimedia.org/r/295475 [12:32:01] (03PS1) 10Hashar: contint: create /var/lib/jenkins/builds [puppet] - 10https://gerrit.wikimedia.org/r/295477 (https://phabricator.wikimedia.org/T80385) [12:33:32] (03CR) 10Alexandros Kosiaris: "@Tim, will this https://gerrit.wikimedia.org/r/295475 serve your needs?" [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff) [12:34:09] !log T80385 stopping Jenkins and migrating all build records to /var/lib/jenkins/builds [12:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:34:36] <_joe_> !log disabling puppet on mw1017, live-hacking it [12:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:00] (03PS1) 10Urbanecm: Add www.wpc.ncep.noaa.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295478 (https://phabricator.wikimedia.org/T138383) [12:35:13] PROBLEM - Apache HTTP on mw1290 is CRITICAL: Connection timed out [12:35:26] !log starting reimage of mw1292 [12:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:55] oh my bad jenkins [12:36:43] PROBLEM - puppet last run on mw1290 is CRITICAL: Timeout while attempting connection [12:37:03] PROBLEM - salt-minion processes on mw1290 is CRITICAL: Timeout while attempting connection [12:37:33] PROBLEM - configured eth on mw1287 is CRITICAL: Timeout while attempting connection [12:37:33] PROBLEM - configured eth on mw1288 is CRITICAL: Timeout while attempting connection [12:37:34] PROBLEM - Apache HTTP on mw1288 is CRITICAL: Connection timed out [12:37:34] PROBLEM - Apache HTTP on mw1287 is CRITICAL: Connection timed out [12:37:43] (03CR) 10Alexandros Kosiaris: "Examples:" [puppet] - 10https://gerrit.wikimedia.org/r/295475 (owner: 10Alexandros Kosiaris) [12:37:54] PROBLEM - dhclient process on mw1288 is CRITICAL: Timeout while attempting connection [12:37:54] PROBLEM - dhclient process on mw1287 is CRITICAL: Timeout while attempting connection [12:37:54] PROBLEM - Check size of conntrack table on mw1290 is CRITICAL: Timeout while attempting connection [12:38:04] PROBLEM - mediawiki-installation DSH group on mw1288 is CRITICAL: Host mw1288 is not in mediawiki-installation dsh group [12:38:04] PROBLEM - mediawiki-installation DSH group on mw1287 is CRITICAL: Host mw1287 is not in mediawiki-installation dsh group [12:38:14] PROBLEM - DPKG on mw1290 is CRITICAL: Timeout while attempting connection [12:38:33] PROBLEM - nutcracker port on mw1287 is CRITICAL: Timeout while attempting connection [12:38:33] PROBLEM - nutcracker port on mw1288 is CRITICAL: Timeout while attempting connection [12:38:33] PROBLEM - Disk space on mw1290 is CRITICAL: Timeout while attempting connection [12:38:55] PROBLEM - MD RAID on mw1290 is CRITICAL: Timeout while attempting connection [12:38:55] PROBLEM - nutcracker process on mw1287 is CRITICAL: Timeout while attempting connection [12:38:55] PROBLEM - nutcracker process on mw1288 is CRITICAL: Timeout while attempting connection [12:39:09] (03CR) 10Alexandros Kosiaris: [C: 032] contint: create /var/lib/jenkins/builds [puppet] - 10https://gerrit.wikimedia.org/r/295477 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [12:39:14] PROBLEM - puppet last run on mw1288 is CRITICAL: Timeout while attempting connection [12:39:15] PROBLEM - puppet last run on mw1287 is CRITICAL: Timeout while attempting connection [12:39:34] PROBLEM - salt-minion processes on mw1287 is CRITICAL: Timeout while attempting connection [12:39:34] PROBLEM - salt-minion processes on mw1288 is CRITICAL: Timeout while attempting connection [12:39:53] PROBLEM - configured eth on mw1290 is CRITICAL: Timeout while attempting connection [12:40:05] PROBLEM - Check size of conntrack table on mw1288 is CRITICAL: Timeout while attempting connection [12:40:05] PROBLEM - dhclient process on mw1290 is CRITICAL: Timeout while attempting connection [12:40:05] PROBLEM - Check size of conntrack table on mw1287 is CRITICAL: Timeout while attempting connection [12:40:14] PROBLEM - mediawiki-installation DSH group on mw1290 is CRITICAL: Host mw1290 is not in mediawiki-installation dsh group [12:40:24] PROBLEM - DPKG on mw1288 is CRITICAL: Timeout while attempting connection [12:40:24] PROBLEM - DPKG on mw1287 is CRITICAL: Timeout while attempting connection [12:40:43] PROBLEM - Disk space on mw1287 is CRITICAL: Timeout while attempting connection [12:40:44] PROBLEM - nutcracker port on mw1290 is CRITICAL: Timeout while attempting connection [12:40:44] PROBLEM - Disk space on mw1288 is CRITICAL: Timeout while attempting connection [12:41:03] PROBLEM - MD RAID on mw1288 is CRITICAL: Timeout while attempting connection [12:41:03] PROBLEM - nutcracker process on mw1290 is CRITICAL: Timeout while attempting connection [12:41:03] PROBLEM - MD RAID on mw1287 is CRITICAL: Timeout while attempting connection [12:43:10] ouch too late [12:43:16] what's all that? [12:43:16] sorry this is me [12:43:45] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [12:43:45] PROBLEM - jenkins_zmq_publisher on gallium is CRITICAL: Connection refused [12:43:50] new appservers, I didn't pay attention to icinga for 10 mins [12:43:52] and they appeared [12:43:55] silencing [12:43:56] sorry [12:44:08] (03PS3) 10Alexandros Kosiaris: ldaplist: Allow searching for more than attribute [puppet] - 10https://gerrit.wikimedia.org/r/295475 [12:44:21] * gehel is hopefully not going to do the same in the next 5 minutes [12:44:35] Hi, whats with Jenkins? See https://gerrit.wikimedia.org/r/#/c/295478/ ... [12:44:47] (03CR) 10Alexandros Kosiaris: "I 'll remove the default substring matches. They cause more bugs than necessary. Perhaps adding them on an per attribute basis makes more " [puppet] - 10https://gerrit.wikimedia.org/r/295475 (owner: 10Alexandros Kosiaris) [12:45:10] Urbanecm: I think hashar is already on it... [12:45:24] Urbanecm: I have shut it down to move a bunch of files [12:45:28] will bring it back soonish [12:45:50] gehel: the funny thing is that I paid attention to icinga until 10/15 minutes ago [12:45:53] stepped out for a second [12:45:53] Thx. Will it find my new patch and verify it? [12:45:56] alarms [12:46:45] Urbanecm: not sure, worst case add a comment with "recheck" and Jenkins should pick it up again [12:47:02] Okay. [12:49:04] !log T80385 Restarting Jenkins with builds dir set to "${JENKINS_HOME}/builds/${ITEM_FULL_NAME}" which is /var/lib/jenkins/builds/XXX [12:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:50:32] INFO: npm-node-4.3 #17374 main build action completed: SUCCESS [12:50:37] looks like some build pass :) [12:50:45] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [12:50:46] Urbanecm: Jenkins is back, it will catch up [12:50:53] RECOVERY - jenkins_zmq_publisher on gallium is OK: TCP OK - 0.000 second response time on port 8888 [12:51:05] the events are held in Zuul for which you can get an idea of the builds at https://integration.wikimedia.org/zuul/ [12:51:13] now that Jenkins is back jobs are running again [12:51:37] Ok [12:52:21] (03CR) 10Steinsplitter: [C: 031] Add www.wpc.ncep.noaa.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295478 (https://phabricator.wikimedia.org/T138383) (owner: 10Urbanecm) [12:54:55] hashar: Argh, restarting Jenkins again? [12:55:04] yeah stopping it again sorry [12:55:25] * James_F is just eager to see the latest sync of code to Beta Cluster. [12:55:27] :-) [12:57:14] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Puppet has 1 failures [12:58:54] (03PS2) 10Rush: labstore: Remove redundant calls to lower() for user names [puppet] - 10https://gerrit.wikimedia.org/r/295455 (owner: 10Tim Landscheidt) [13:02:16] almost done [13:02:24] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [13:02:53] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [13:03:16] !log Manually moved some missing build records. Restarting Jenkins [13:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:05:36] (03CR) 10Rush: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/295455 (owner: 10Tim Landscheidt) [13:06:05] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.002 second response time [13:07:19] (03CR) 10Rush: [C: 032] "thanks Tim, seems seems fine" [puppet] - 10https://gerrit.wikimedia.org/r/295455 (owner: 10Tim Landscheidt) [13:07:41] James_F: Jenkins should be all back [13:07:54] hashar: I caught something of yours on merge [13:07:55] Antoine Musso: contint: create /var/lib/jenkins/builds [13:07:58] is this ok? [13:08:05] hashar: Thank you! [13:08:09] chasemp: yes [13:08:28] chasemp: I have manually created it on the host (gallium) Alexandros merged that change [13:08:55] cool no biggie just convention to double chek [13:08:57] the puppet change should just be about creating a directory [13:08:58] check [13:09:05] yeah better to double check :-} [13:09:24] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:09:54] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [13:10:33] RECOVERY - salt-minion processes on mw1287 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:10:44] RECOVERY - configured eth on mw1287 is OK: OK - interfaces up [13:10:59] (03CR) 10Hashar: [C: 031] "I have updated update the Jenkins configuration to save the build record under /var/lib/jenkins/builds and have migrated all existing reco" [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [13:11:04] RECOVERY - Check size of conntrack table on mw1287 is OK: OK: nf_conntrack is 0 % full [13:11:14] RECOVERY - dhclient process on mw1287 is OK: PROCS OK: 0 processes with command name dhclient [13:11:35] RECOVERY - Disk space on mw1287 is OK: DISK OK [13:11:44] RECOVERY - nutcracker port on mw1287 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:11:51] and Icinga checks for gallium are all green [13:12:04] RECOVERY - MD RAID on mw1287 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:12:15] RECOVERY - nutcracker process on mw1287 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [13:13:18] (03PS2) 10Hashar: Enable backup for gallium [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [13:14:10] (03CR) 10Hashar: "I have rebased this change on top of https://gerrit.wikimedia.org/r/#/c/295255/ which add an exclude rule to prevent backing up the Jenkin" [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [13:15:44] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.027 second response time [13:16:09] akosiaris: moritzm: I got the Jenkins build history migrated \O/ Would need to add the exclude rule in bacula director then enable the backup system whenever one can monitor its actions [13:16:14] RECOVERY - DPKG on mw1287 is OK: All packages OK [13:17:29] (03PS1) 10Gehel: Configuring new elastic1038-1042 servers [puppet] - 10https://gerrit.wikimedia.org/r/295490 (https://phabricator.wikimedia.org/T138329) [13:17:43] (03CR) 10Muehlenhoff: "We still have the discrepancy between the retention in Jenkins (15/30 days) compared to the backup retention period (60 days).@Alex, is th" [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [13:17:54] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.011 second response time [13:19:24] RECOVERY - nutcracker process on mw1288 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [13:19:54] RECOVERY - salt-minion processes on mw1288 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:20:14] RECOVERY - configured eth on mw1288 is OK: OK - interfaces up [13:20:34] RECOVERY - Check size of conntrack table on mw1288 is OK: OK: nf_conntrack is 0 % full [13:20:44] RECOVERY - dhclient process on mw1288 is OK: PROCS OK: 0 processes with command name dhclient [13:20:46] (03CR) 10Hashar: "The 15/30 days Jenkins retentions are for the build records / artifacts etc that are in /var/lib/jenkins/builds . Given this change excl" [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [13:21:05] RECOVERY - Disk space on mw1288 is OK: DISK OK [13:21:14] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:21:15] RECOVERY - nutcracker port on mw1288 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:21:25] RECOVERY - MD RAID on mw1288 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:22:04] RECOVERY - salt-minion processes on mw1290 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:22:44] RECOVERY - configured eth on mw1290 is OK: OK - interfaces up [13:22:54] RECOVERY - dhclient process on mw1290 is OK: PROCS OK: 0 processes with command name dhclient [13:23:03] RECOVERY - Check size of conntrack table on mw1290 is OK: OK: nf_conntrack is 0 % full [13:23:24] RECOVERY - nutcracker port on mw1290 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:23:34] RECOVERY - Disk space on mw1290 is OK: DISK OK [13:23:44] RECOVERY - nutcracker process on mw1290 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [13:24:04] RECOVERY - MD RAID on mw1290 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:24:10] these are the appservers after the first puppet run, sorry again for the spam [13:25:34] RECOVERY - DPKG on mw1288 is OK: All packages OK [13:25:34] RECOVERY - DPKG on mw1290 is OK: All packages OK [13:29:21] (03PS1) 10Filippo Giunchedi: swift: align partition to 1M boundary [puppet] - 10https://gerrit.wikimedia.org/r/295492 [13:31:34] !log configuring new elasticsearch servers elastic1038-1042 in eqiad [13:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:31:48] (03CR) 10Gehel: [C: 032] Configuring new elastic1038-1042 servers [puppet] - 10https://gerrit.wikimedia.org/r/295490 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [13:34:53] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [13:35:39] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2399018 (10fgiunchedi) @papaul the two ssd were in raid1, was it the default configuration? I'm asking because in this case we need all disks in raid0, this is what I... [13:41:29] (03PS1) 10Giuseppe Lavagetto: role::cache::text: handle url shortener requests [puppet] - 10https://gerrit.wikimedia.org/r/295493 (https://phabricator.wikimedia.org/T133485) [13:42:57] !log add 500G to fluorine /a (almost full) [13:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:45:23] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: SVG rendering with marker-element is different between librsvg and Inkscape - https://phabricator.wikimedia.org/T97758#2399052 (10MoritzMuehlenhoff) [13:45:35] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: SVG rendering with marker-element is different between librsvg and Inkscape - https://phabricator.wikimedia.org/T97758#1251624 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:46:23] godog: we would want to one day revisit what we collect on fluorine :D api.log is already 45GBytes large .. [13:50:25] 06Operations, 10Wikimedia-SVG-rendering: Install Amiri font (arabic) for svg - https://phabricator.wikimedia.org/T135347#2399101 (10MoritzMuehlenhoff) [13:51:14] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:51:31] 06Operations, 10Wikimedia-SVG-rendering: Install Amiri font (arabic) for svg - https://phabricator.wikimedia.org/T135347#2295971 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:51:37] hashar: heh indeed, there's already 90d retention but it keeps slowly growing too [13:53:45] godog: ah yeah Jenkins has its own 90 days history of all configuration changes [13:54:28] so in theory if we look at a 60 days old backup from bacula we could get the config from 150 days ago.. [13:56:26] (03PS1) 10Elukey: Add new MediaWiki appservers to the scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295497 [13:56:38] (03PS1) 10Muehlenhoff: Add Amiri font to the scalers [puppet] - 10https://gerrit.wikimedia.org/r/295498 (https://phabricator.wikimedia.org/T135347) [13:57:15] (03PS2) 10Giuseppe Lavagetto: role::cache::text: handle url shortener requests [puppet] - 10https://gerrit.wikimedia.org/r/295493 (https://phabricator.wikimedia.org/T133485) [13:57:53] (03CR) 10Alexandros Kosiaris: [C: 032] "We can safely ignore the retention disrepancy. It's a common pattern." [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [13:58:00] (03PS2) 10Alexandros Kosiaris: contint: do not backup Jenkins build history [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [13:58:03] (03CR) 10Eevans: [C: 031] "My Puppet-fu is weak, but this LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/295123 (https://phabricator.wikimedia.org/T137422) (owner: 10Nicko) [13:58:14] (03CR) 10Alexandros Kosiaris: [V: 032] contint: do not backup Jenkins build history [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [13:59:11] (03PS3) 10Alexandros Kosiaris: Enable backup for gallium [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [13:59:18] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Enable backup for gallium [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [13:59:56] (03CR) 10Gehel: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/295497 (owner: 10Elukey) [14:00:05] (03CR) 10Gehel: [C: 031] Add new MediaWiki appservers to the scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295497 (owner: 10Elukey) [14:00:21] (03PS2) 10Elukey: Add new MediaWiki appservers to the scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295497 [14:02:43] (03CR) 10BBlack: role::cache::text: handle url shortener requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295493 (https://phabricator.wikimedia.org/T133485) (owner: 10Giuseppe Lavagetto) [14:03:11] (03CR) 10Elukey: [C: 032 V: 032] Add new MediaWiki appservers to the scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295497 (owner: 10Elukey) [14:04:13] !log rolling restart of hhvm/apache on app servers in eqiad for expat security update [14:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:25] (03PS1) 10Filippo Giunchedi: install_server: add partman recipe for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/295499 [14:08:20] (03PS2) 10Filippo Giunchedi: install_server: add partman recipe for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/295499 [14:08:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: add partman recipe for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/295499 (owner: 10Filippo Giunchedi) [14:10:25] 06Operations, 10Wikimedia-SVG-rendering, 13Patch-For-Review: Install Amiri font (arabic) for svg - https://phabricator.wikimedia.org/T135347#2399192 (10MoritzMuehlenhoff) @Uwe_a: I have prepared a patch to install that font on the Wikimedia servers. Do you have a test case SVG which would visually improve if... [14:13:16] (03PS3) 10Legoktm: role::cache::text: handle url shortener requests [puppet] - 10https://gerrit.wikimedia.org/r/295493 (https://phabricator.wikimedia.org/T133485) (owner: 10Giuseppe Lavagetto) [14:13:19] (03CR) 10Legoktm: role::cache::text: handle url shortener requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295493 (https://phabricator.wikimedia.org/T133485) (owner: 10Giuseppe Lavagetto) [14:20:02] (03CR) 10Alexandros Kosiaris: "2000 OK estimate files=800,331 bytes=31,458,895,274" [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [14:27:20] (03CR) 10BBlack: [C: 031] role::cache::text: handle url shortener requests [puppet] - 10https://gerrit.wikimedia.org/r/295493 (https://phabricator.wikimedia.org/T133485) (owner: 10Giuseppe Lavagetto) [14:28:13] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [14:29:33] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [14:29:44] !log running https://phabricator.wikimedia.org/diffusion/ECAU/browse/master/maintenance/checkLocalUser.php for some users T119736 [14:29:45] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [14:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:30:24] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2399328 (10Papaul) @fgiunchedi yes the default was raid1 i can but that in raid 0 like the other disks [14:32:32] !log checksumming m1 databases in preparation for failover [14:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:43] RECOVERY - mediawiki-installation DSH group on mw1288 is OK: OK [14:40:43] RECOVERY - mediawiki-installation DSH group on mw1287 is OK: OK [14:43:04] RECOVERY - mediawiki-installation DSH group on mw1290 is OK: OK [14:49:37] (03PS5) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 [15:00:04] anomie, ostriches, thcipriani, marktraceur, and Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160622T1500). Please do the needful. [15:00:04] dapatrick and Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:28] Around [15:00:32] 06Operations, 10Gerrit, 06Release-Engineering-Team, 06WMF-Legal, and 2 others: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#1694145 (10Mpaulson) Has this been adjusted so that it deletes the logs after 30 days? [15:01:14] I can SWAT today. [15:01:27] !log rebooting bohrium.eqiad.wmnet (running piwik) for kernel upgrades [15:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:03] dapatrick: is there a backport for https://gerrit.wikimedia.org/r/#/c/295191/ ? [15:02:45] thcipriani: Shoot. No, I didn't do that. [15:03:04] I can do it, is it just for wmf.6? [15:03:31] I believe so, yes. What versions are currently in use? [15:04:23] wmf.7 just made it to testwiki, not sure if this made it in before the cut https://noc.wikimedia.org/conf/ [15:05:50] hmm, doesn't look like it made it in [15:06:10] Nope. It was merged after the branch was cut. [15:06:27] ack. kk, so backport for wmf.6 and wmf.7? [15:07:54] Yes, please. Thanks! Sorry I didn't cherry pick those myself. [15:07:55] dapatrick: could you check me on these, please: https://gerrit.wikimedia.org/r/#/c/295510/ https://gerrit.wikimedia.org/r/#/c/295511/ [15:08:00] np :) [15:08:28] (03PS2) 10Thcipriani: Add www.wpc.ncep.noaa.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295478 (https://phabricator.wikimedia.org/T138383) (owner: 10Urbanecm) [15:09:30] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295478 (https://phabricator.wikimedia.org/T138383) (owner: 10Urbanecm) [15:09:55] thcipriani: Those look good. [15:10:06] dapatrick: cool, thanks [15:10:09] (03Merged) 10jenkins-bot: Add www.wpc.ncep.noaa.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295478 (https://phabricator.wikimedia.org/T138383) (owner: 10Urbanecm) [15:13:19] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:295478|Add www.wpc.ncep.noaa.gov to wgCopyUploadsDomains]] (duration: 00m 54s) [15:13:34] ^ Urbanecm check please [15:13:35] 06Operations, 10ops-eqiad: eqiad: Install SSD's into ganeti hosts - https://phabricator.wikimedia.org/T138414#2399490 (10Cmjohnson) [15:14:12] I have no access to any tool which use this whitelist so I have to ask the author of the request on phab. [15:14:51] Urbanecm: okie doke, well, it's live now :) [15:14:59] (03CR) 10Dereckson: Add www.wpc.ncep.noaa.gov to wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295478 (https://phabricator.wikimedia.org/T138383) (owner: 10Urbanecm) [15:15:18] (03PS1) 10Filippo Giunchedi: install_server: smaller root for single-disk /srv [puppet] - 10https://gerrit.wikimedia.org/r/295513 [15:17:01] (03PS1) 10Dereckson: Improve style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295514 [15:17:22] (03CR) 10Dereckson: "Follow-up: I83bb5af83df49c4243f6bd68002c62b76afc0226" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295478 (https://phabricator.wikimedia.org/T138383) (owner: 10Urbanecm) [15:17:55] Hi, may I suggest to merge https://gerrit.wikimedia.org/r/#/c/295514/ too to fix the comma issue for 295478? [15:18:00] !log thcipriani@tin Synchronized php-1.28.0-wmf.7/extensions/OATHAuth: SWAT: [[gerrit:295511|Fixup qrcode-generating js, to stop race condition.]] (duration: 00m 27s) [15:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:08] ^ dapatrick check please [15:18:22] Checking. [15:19:03] Urbanecm: the goal of this trailing comma is the next diff only touches the line you add, not the line before, so `git blame` is more accurate [15:19:26] Dereckson: sure, thank you :) [15:20:26] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295514 (owner: 10Dereckson) [15:20:28] You're welcome. [15:20:56] Dereckson: Thx for the patch. I just noticed your comment in the task on phab so I'm going to create patch which whitelist *.noaa.gov instead of only one domain. [15:21:11] (03Merged) 10jenkins-bot: Improve style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295514 (owner: 10Dereckson) [15:21:28] Urbanecm: I noticed * only replaces one subdomain [15:21:48] thcipriani: Weird. This looks like old code on enwiki. [15:21:49] so *.nooa.gov would be for quux.nooa.gov but not for www.quux.nooa.gov nor www.alpha.beta.noaa.gov:( [15:22:12] And do you know how to whitelist all subdomains of nooa.gov? [15:22:27] dapatrick: ah, enwiki is on wmf.6, I just sync'd wmf.7 so far [15:22:33] With 3 entries: *.nooa.gov *.*.nooa.gov *.*.*.nooa.gov [15:22:39] But I'm not sure it's a really good idea. [15:22:49] If we do that, that means we trust EVERY of their server. [15:23:10] This whitelist is restricted to avoid DDoS from the Wikimedia Cluster to their network, or from their network to our cluster [15:23:12] dapatrick: possible to test on any group0 wikis, or should I roll forward with wmf.7? [15:23:13] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:295514|Improve style]] (duration: 00m 33s) [15:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:21] ^ Dereckson sync'd [15:23:26] 06Operations, 10Traffic, 13Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#2399537 (10ema) A few more things tools-wise. TFO support has been added to curl in version 7.49.0: [[ https://curl.haxx.se/changes.html#7_49_0 | --tcp-fastopen]]. Unfortunately De... [15:23:28] ack'd [15:23:37] I think that at nooa.gov should be only content created by them so it should be all in public domain. [15:23:40] dapatrick: er...roll forward with wmf.6 [15:23:41] We don't know how NOAA organizes their server (one datacenter? a lot?) [15:24:06] So should we whitelist only the domain which was mentoined in the request? [15:24:07] Yes, but it's an operations and security concern here, as you would increase our surface attack [15:24:11] thcipriani: Ah, yes, go ahead with wmf.6. [15:24:17] dapatrick: ack. [15:24:34] And if we add only *.*.nooa.gov? All mentoined domains will match this filter I think. [15:24:37] I confirm NOAA content is mostyl PD-Gov, so there isn't any concern for the licensing [15:25:18] but imagine they have something.nooa.gov in another network from their own, you will also whitelist this one. [15:25:39] I'd suggest here to be conservative and only whitelist the needed domains; [15:25:56] you could reach csteipp for a second opinion if you find information about how NOAA manages their network sources. [15:25:58] And should I add the domain that you've mentoined in the request? [15:25:59] resources [15:26:27] Perhaps, but let's ask Fae first. [15:26:45] In the task? [15:26:45] !log thcipriani@tin Synchronized php-1.28.0-wmf.6/extensions/OATHAuth: SWAT: [[gerrit:295510|Fixup qrcode-generating js, to stop race condition.]] (duration: 00m 33s) [15:26:48] (03PS1) 10Gehel: Configuring new elastic1043-1047 servers [puppet] - 10https://gerrit.wikimedia.org/r/295524 (https://phabricator.wikimedia.org/T138329) [15:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:49] ^ dapatrick check please [15:28:46] Urbanecm: I pinged it on IRC, but not sure they are online. [15:29:15] (03PS1) 10Gehel: Add new MediaWiki appserver to the scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295525 [15:29:17] Nobody with this nick is in this channel. [15:29:29] #wikimedia-commons [15:30:33] By the way, I tested an upload by url, that works for www.wpc.ncep.noaa.gov. [15:31:14] Thx [15:31:19] (03CR) 10Elukey: [C: 031] Add new MediaWiki appserver to the scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295525 (owner: 10Gehel) [15:31:35] I would like to thank you to take care of these requests so quickly. That's appreciated. [15:32:27] thcipriani: Working as expected. Thanks! [15:32:40] dapatrick: awesome, thanks for checking! [15:33:02] (03CR) 10Gehel: [C: 032] Add new MediaWiki appserver to the scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295525 (owner: 10Gehel) [15:39:20] PROBLEM - swift-object-auditor on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:39:31] PROBLEM - swift-object-replicator on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:39:40] PROBLEM - swift-account-reaper on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:39:41] PROBLEM - swift-object-server on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:40:00] PROBLEM - swift-account-auditor on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:40:11] PROBLEM - swift-container-auditor on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:40:11] PROBLEM - swift-account-server on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:40:11] PROBLEM - swift-object-updater on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [15:40:41] PROBLEM - swift-container-server on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:40:41] PROBLEM - swift-account-replicator on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [15:40:58] err, that's me ^ apologies [15:41:11] PROBLEM - swift-container-updater on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:41:21] PROBLEM - swift-container-replicator on ms-be2022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [15:41:31] RECOVERY - swift-object-auditor on ms-be2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:41:50] RECOVERY - swift-object-replicator on ms-be2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:41:52] RECOVERY - swift-account-reaper on ms-be2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:42:01] RECOVERY - swift-object-server on ms-be2022 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:42:21] RECOVERY - swift-account-auditor on ms-be2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:42:31] RECOVERY - swift-container-auditor on ms-be2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:42:31] RECOVERY - swift-account-server on ms-be2022 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:42:31] RECOVERY - swift-object-updater on ms-be2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [15:43:01] RECOVERY - swift-container-server on ms-be2022 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:43:01] RECOVERY - swift-account-replicator on ms-be2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [15:43:31] RECOVERY - swift-container-updater on ms-be2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:43:41] RECOVERY - swift-container-replicator on ms-be2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [15:48:35] 06Operations, 10ops-eqiad, 06DC-Ops: dbstore1001 management interface has saturated the number of available ssh connections - https://phabricator.wikimedia.org/T126227#2399596 (10Cmjohnson) 05Open>03Resolved Fixed [15:49:10] 06Operations, 10ops-eqiad: db1061 and db1062 management interface needs physical reset - https://phabricator.wikimedia.org/T138368#2399599 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Fixed [15:49:21] (03PS1) 10Filippo Giunchedi: install_server: separate /srv for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/295532 [15:50:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: separate /srv for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/295532 (owner: 10Filippo Giunchedi) [15:51:26] 06Operations, 10ops-eqiad: eqiad: Install ssds to labmon1001 - https://phabricator.wikimedia.org/T138415#2399605 (10Krenair) [15:51:42] 06Operations, 10ops-eqiad: eqiad: Install ssds to labmon1001 - https://phabricator.wikimedia.org/T138415#2399608 (10Krenair) [15:54:47] 06Operations, 10ops-eqiad: rack/setup/install/deploy labsdb1009-labsdb1011 - https://phabricator.wikimedia.org/T136860#2399627 (10Cmjohnson) I cannot rack this in A5, it is a 10G rack. [15:56:13] 06Operations, 10LDAP-Access-Requests: LDAP Account required for Transparency Report - https://phabricator.wikimedia.org/T138369#2398350 (10Krenair) You don't appear on https://wikimediafoundation.org/wiki/Staff_and_contractors nor a couple of other pages I checked... Do you have a " (WMF)" SUL account or somet... [15:58:02] 06Operations, 10ops-eqiad: mw1063 broken - https://phabricator.wikimedia.org/T137381#2399632 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson mw1063 has been decommissioned per T129060 [15:58:54] (03PS2) 10Gehel: Configuring new elastic1043-1047 servers [puppet] - 10https://gerrit.wikimedia.org/r/295524 (https://phabricator.wikimedia.org/T138329) [16:00:42] (03CR) 10Gehel: [C: 032] Configuring new elastic1043-1047 servers [puppet] - 10https://gerrit.wikimedia.org/r/295524 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [16:05:07] (03PS1) 10Gehel: Adding rack location of new elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/295536 (https://phabricator.wikimedia.org/T138329) [16:06:38] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2399672 (10Cmjohnson) Disk has been ordered. [16:06:44] (03CR) 10Gehel: [C: 032] Adding rack location of new elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/295536 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [16:09:23] (03PS1) 10Gehel: Fixed missing location of elastic1045 [puppet] - 10https://gerrit.wikimedia.org/r/295537 (https://phabricator.wikimedia.org/T138329) [16:10:06] (03PS1) 10BBlack: r::c::perf: consolidate net tuning and mysterious values [puppet] - 10https://gerrit.wikimedia.org/r/295538 [16:10:08] (03PS1) 10BBlack: r::c::perf: un-mysterious netdev_max_backlog [puppet] - 10https://gerrit.wikimedia.org/r/295539 [16:10:10] (03PS1) 10BBlack: r::c::perf: un-mysterious somaxconn + syn_backlog [puppet] - 10https://gerrit.wikimedia.org/r/295540 [16:10:12] (03PS1) 10BBlack: r::c::perf: un-mysterious the rest [puppet] - 10https://gerrit.wikimedia.org/r/295541 [16:10:14] (03PS1) 10BBlack: r::c::perf: disable prequeue timestamps [puppet] - 10https://gerrit.wikimedia.org/r/295542 [16:10:16] (03PS1) 10BBlack: r::c::perf: raise netdev_budget a bit [puppet] - 10https://gerrit.wikimedia.org/r/295543 [16:10:18] (03PS1) 10BBlack: LVS: sysctl netdev tuning [puppet] - 10https://gerrit.wikimedia.org/r/295544 [16:10:40] (03CR) 10Gehel: [C: 032] Fixed missing location of elastic1045 [puppet] - 10https://gerrit.wikimedia.org/r/295537 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [16:13:29] 06Operations, 10Traffic, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#1970940 (10Krinkle) How does the cache ttl of Varnish interact with the concept of 304 renewals? I remember in the past we often had bugs where a cache object had expired (but not ye... [16:13:34] ori_: bblack: ^ [16:16:50] (03CR) 10jenkins-bot: [V: 04-1] LVS: sysctl netdev tuning [puppet] - 10https://gerrit.wikimedia.org/r/295544 (owner: 10BBlack) [16:16:58] 06Operations, 10ops-eqiad: db1009 degraded RAID (failed disk) - https://phabricator.wikimedia.org/T138203#2393080 (10Cmjohnson) Disk swapped rebuilding root@db1009:~# megacli -PDList -aALL |grep "Firmware state:" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up F... [16:17:27] 06Operations, 10LDAP-Access-Requests: LDAP Account required for Transparency Report - https://phabricator.wikimedia.org/T138369#2399740 (10siddharth11) I've just recently joined WMF, that too for couple of months. My main work is to collaborate with the Legal team comprising of Aeryn Palmer and James Buatti to... [16:18:13] (03CR) 10Krinkle: Only mirror refs/heads/ and refs/tags/ for mw core and operations/puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295011 (owner: 10Paladox) [16:19:43] (03CR) 10Paladox: Only mirror refs/heads/ and refs/tags/ for mw core and operations/puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295011 (owner: 10Paladox) [16:19:50] Krinkle ^^ [16:20:30] !log new elasticsearch servers elastic1032-1047 are configured and have joined the eqiad cluster [16:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:48] paladox: What did you test? Did you test having a repository in Phab and on GitHub and seeing it replicate correctly? [16:20:57] Krinkle yes [16:21:15] How does it now to connect MW with github/wikimedia/mediawiki ? [16:21:16] I tested on my local install, using a imported repo from gerrit which was mw core. [16:21:37] Krinkle since mw core is big it wont push. [16:21:51] I asked github and they said they do have a memory limit or a limit. [16:22:18] mw core now includes refs/changes/ same for operations/puppet and they have so much they wont push. [16:22:22] How does Phabricator convert repo name "MW (mediawiki)" into url https://github.com/wikimedia/mediawiki and know to use that configuration? [16:22:42] mw-core isn't replicated from PHabricator right now, it's from Gerrit, right? [16:22:46] Krinkle since in the mirror uri it does git push --mirror [16:22:50] Which only pushes branches and tags. [16:22:57] and yes it is being replicated from phabricator [16:23:17] But dosent work due to refs/changes/ which has so many refs [16:23:42] Krinkle https://phabricator.wikimedia.org/diffusion/MW/uri/view/4/ [16:24:05] (03PS1) 10Yurik: Maps: Limit query exec time for kartotherian user [puppet] - 10https://gerrit.wikimedia.org/r/295548 (https://phabricator.wikimedia.org/T138422) [16:24:06] it does git push --mirror https://github.com/wikimedia/mediawiki