[00:03:10] (03CR) 10jenkins-bot: Make babel use Database and SUL wikis use metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368429 (https://phabricator.wikimedia.org/T145366) (owner: 10Reedy) [00:33:14] 10Operations, 10Services (done), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3491715 (10Nuria) [00:34:03] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3491724 (10Nuria) 05Open>03Resolved [00:34:13] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3491727 (10Nuria) 05Open>03Resolved [01:58:31] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 45 [02:27:11] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.11) (duration: 07m 35s) [02:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:30] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.12) (duration: 05m 54s) [02:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:33] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Aug 2 02:59:33 UTC 2017 (duration 7m 3s) [02:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:30] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 693.00 seconds [04:08:40] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 272.97 seconds [04:33:10] PROBLEM - cassandra-b service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [04:33:10] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:33:11] PROBLEM - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.168 and port 9042: Connection refused [04:33:31] PROBLEM - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [04:54:11] RECOVERY - cassandra-b service on restbase-dev1004 is OK: OK - cassandra-b is active [04:54:20] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [04:55:40] RECOVERY - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-b valid until 2018-07-20 15:08:05 +0000 (expires in 352 days) [04:56:20] RECOVERY - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.168 port 9042 [04:57:02] (03PS1) 10EBernhardson: CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) [04:58:00] PROBLEM - SSH cp3031.mgmt on cp3031.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:58:23] (03CR) 10jerkins-bot: [V: 04-1] CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [05:06:01] (03PS2) 10EBernhardson: CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) [05:06:59] (03CR) 10jerkins-bot: [V: 04-1] CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [05:10:58] (03PS3) 10EBernhardson: CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) [05:44:31] (03PS1) 10Chad: Add newlines to all dblists missing them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369595 [05:44:33] (03CR) 10Chad: [C: 032] Add newlines to all dblists missing them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369595 (owner: 10Chad) [05:46:14] (03Merged) 10jenkins-bot: Add newlines to all dblists missing them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369595 (owner: 10Chad) [05:46:24] (03CR) 10jenkins-bot: Add newlines to all dblists missing them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369595 (owner: 10Chad) [05:47:24] !log demon@tin Synchronized dblists/: No-op (duration: 00m 47s) [05:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:20] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:11] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:20] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:21] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:21] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:21] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:21] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:23:10] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:23:11] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:23:11] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:23:11] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:23:11] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [06:23:11] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:32:30] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:41] Backups ^ [06:45:27] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369598 [06:45:31] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369598 [06:48:28] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369598 (owner: 10Marostegui) [06:49:57] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369598 (owner: 10Marostegui) [06:50:09] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369598 (owner: 10Marostegui) [06:50:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 - T166204 (duration: 00m 45s) [06:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:11] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [06:57:09] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369599 [06:57:12] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369599 [06:57:51] RECOVERY - SSH cp3031.mgmt on cp3031.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0) [06:58:40] (03PS1) 10Marostegui: db2057.yaml: Update db2057 socket location [puppet] - 10https://gerrit.wikimedia.org/r/369600 (https://phabricator.wikimedia.org/T148507) [06:58:43] !log Stop MySQL on db2057 for maintenance - T148507 [06:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:40] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [07:03:30] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/7255/" [puppet] - 10https://gerrit.wikimedia.org/r/369600 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui) [07:03:30] PROBLEM - MegaRAID on es2013 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [07:03:47] ^ checking [07:05:29] It started a re-learn cycle [07:05:39] I am going to disable auto-learn [07:09:01] !log Disable BBU auto-learn on es2013 [07:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:42] (03CR) 10Marostegui: [C: 032] db2057.yaml: Update db2057 socket location [puppet] - 10https://gerrit.wikimedia.org/r/369600 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui) [07:11:58] (03CR) 10WMDE-leszek: [C: 031] Write to term_full_entity_id column in wb_terms table in prod too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369567 (https://phabricator.wikimedia.org/T167229) (owner: 10Ladsgroup) [07:13:00] (03CR) 10WMDE-leszek: [C: 031] Write to term_full_entity_id column in wb_terms table in prod too (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369567 (https://phabricator.wikimedia.org/T167229) (owner: 10Ladsgroup) [07:23:30] RECOVERY - MegaRAID on es2013 is OK: OK: optimal, 1 logical, 12 physical, WriteBack policy [07:34:12] 10Operations, 10monitoring, 10netops: "MySQL server has gone away" from librenms logs - https://phabricator.wikimedia.org/T171714#3492098 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The move should have been completed by the 26th but indeed the errors are gone, tentatively resolving [07:40:40] (03PS1) 10Elukey: cdh:hive: set minimum jvm heap size to 4g [puppet] - 10https://gerrit.wikimedia.org/r/369601 [07:43:42] (03CR) 10Elukey: [C: 032] cdh:hive: set minimum jvm heap size to 4g [puppet] - 10https://gerrit.wikimedia.org/r/369601 (owner: 10Elukey) [07:46:31] (03CR) 10Volans: [C: 032] Style: fix some newly reported vulture violations [software/cumin] - 10https://gerrit.wikimedia.org/r/369384 (owner: 10Volans) [07:49:26] (03Merged) 10jenkins-bot: Style: fix some newly reported vulture violations [software/cumin] - 10https://gerrit.wikimedia.org/r/369384 (owner: 10Volans) [07:51:18] (03CR) 10Volans: [C: 032] Tests: simplify and improve parametrized tests [software/cumin] - 10https://gerrit.wikimedia.org/r/366733 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [07:52:56] (03Merged) 10jenkins-bot: Tests: simplify and improve parametrized tests [software/cumin] - 10https://gerrit.wikimedia.org/r/366733 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [08:01:39] (03CR) 10Volans: [C: 032] CLI: simplify imports and introspection [software/cumin] - 10https://gerrit.wikimedia.org/r/366734 (owner: 10Volans) [08:04:23] (03Merged) 10jenkins-bot: CLI: simplify imports and introspection [software/cumin] - 10https://gerrit.wikimedia.org/r/366734 (owner: 10Volans) [08:08:49] (03CR) 10Volans: [C: 032] Logging: add a custom trace() logging level [software/cumin] - 10https://gerrit.wikimedia.org/r/366735 (owner: 10Volans) [08:10:37] (03Merged) 10jenkins-bot: Logging: add a custom trace() logging level [software/cumin] - 10https://gerrit.wikimedia.org/r/366735 (owner: 10Volans) [08:10:52] volans: good morning. I think we can bump tox to 2.5.0 now :-} https://gerrit.wikimedia.org/r/#/c/368616/ [08:11:16] volans: tried it out a bit yesterday and if I caught a side effect this morning. But not much to worry about [08:11:18] and fixed [08:11:31] so I guess you can merge that patch and I will take care of refreshing the CI image :} [08:11:46] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369606 (https://phabricator.wikimedia.org/T170662) [08:12:27] hashar: sounds like a plan ;) [08:13:06] * volans running a puppet compiler JIC [08:14:50] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 0.031 second response time [08:15:14] !log bounce pdfrender on scb1001, stuck [08:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:27] hashar: is that class applied in any prod host? (I don't think so) [08:20:33] (03PS1) 10Elukey: statistics::discovery: fix daily logrotate [puppet] - 10https://gerrit.wikimedia.org/r/369608 (https://phabricator.wikimedia.org/T132324) [08:21:50] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/369608 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [08:23:08] volans: labs only for sure :) [08:23:51] volans: in September we will probably extract most of the contint stuff out of puppet.git [08:24:29] we will probably start transitioning jobs to Docker images and thus handle the provisioning with docker shell command [08:24:30] s [08:24:53] :) [08:25:00] do you want me to merge it now? [08:25:33] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7258/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/369608 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [08:25:46] thanks gehel ! [08:25:54] elukey: np [08:28:00] volans: yes :-} [08:28:30] elukey: did already merged yours? [08:28:58] (03PS2) 10Volans: contint: upgrade tox [puppet] - 10https://gerrit.wikimedia.org/r/368616 (https://phabricator.wikimedia.org/T169602) (owner: 10Hashar) [08:29:23] yep! [08:30:29] thx [08:30:31] (03CR) 10Volans: [C: 032] contint: upgrade tox [puppet] - 10https://gerrit.wikimedia.org/r/368616 (https://phabricator.wikimedia.org/T169602) (owner: 10Hashar) [08:31:34] hashar: merged! :) [08:31:41] neat thanks [08:32:12] !log Drop table click_tracking wherever it exists - T115982 [08:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:21] T115982: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982 [08:32:24] volans: did you have a patch blocked on tox being upgraded? [08:35:25] hashar: yes but I did it with a workaround that worked with the current version too, if only I could remember what was it I'll resend a patch with it :-P [08:35:46] volans: that was for cumin wasn't it ? [08:36:03] yep [08:37:18] * hashar TIL about python security scanner https://pypi.python.org/pypi/bandit [08:38:05] eheheh, that runs on cumin :-P [08:42:15] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3492207 (10Marostegui) `click_tracking` table has been dropped. [08:42:17] (03PS2) 10Ladsgroup: Write to term_full_entity_id column in wb_terms table in prod too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369567 (https://phabricator.wikimedia.org/T167229) [08:42:19] (03CR) 10Ladsgroup: Write to term_full_entity_id column in wb_terms table in prod too (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369567 (https://phabricator.wikimedia.org/T167229) (owner: 10Ladsgroup) [08:42:24] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3492208 (10Marostegui) [08:42:53] (03CR) 10WMDE-leszek: [C: 031] Write to term_full_entity_id column in wb_terms table in prod too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369567 (https://phabricator.wikimedia.org/T167229) (owner: 10Ladsgroup) [08:47:10] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369606 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:48:34] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369606 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:48:43] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369606 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:50:41] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db2073 to s4 - T170662 (duration: 00m 57s) [08:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:54] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [08:51:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db2073 to s4 - T170662 (duration: 00m 47s) [08:51:45] (03PS3) 10Marostegui: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369599 [08:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:35] (03PS1) 10Giuseppe Lavagetto: Switch to html representation of diffs [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/369610 [08:52:37] (03PS1) 10Giuseppe Lavagetto: Drop dead code to do diffing by shelling out. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/369611 [08:53:19] (03PS2) 10Giuseppe Lavagetto: Drop dead code to do diffing by shelling out. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/369611 [08:53:21] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) (owner: 10Volans) [08:54:10] (03PS3) 10Giuseppe Lavagetto: Drop dead code to do diffing by shelling out. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/369611 [08:54:35] volans: tox is upgraded and the jenkins job for puppet works properly :-} [08:54:58] hashar: \o/ [08:56:09] hashar: given we're on topic, I noticed that the console output of the tox CI job doesn't show the list of all installed dependencied with version on the virtualenvs [08:56:14] while I get that running tox locally [08:56:27] let's see if was the older version ;) [08:56:36] (03CR) 10Giuseppe Lavagetto: [C: 032] Switch to html representation of diffs [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/369610 (owner: 10Giuseppe Lavagetto) [08:56:57] using tox-1.9.2 f [08:56:58] grmblblbl [08:57:40] hashar: did you re-generate the image from scratch? the puppet code will not take care of the upgrade of the package [08:57:49] nop [08:57:51] just the pinning, if it's already installed it will not be touched [08:58:02] because ensure => present [08:58:10] and I forgot to get puppet to uninstall the tox version installed by pip under /usr/local/ [08:58:16] I will regenerate the image [08:58:38] ok [08:58:49] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369599 (owner: 10Marostegui) [09:01:05] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Reading-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10fgiunchedi) This happened today at around 6:58 on scb1001, due to oomkill... [09:01:32] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369599 (owner: 10Marostegui) [09:01:41] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369599 (owner: 10Marostegui) [09:02:32] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2057 - T170662 (duration: 00m 46s) [09:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:48] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [09:03:03] 10Operations, 10monitoring: diamond nutcracker collector failing and spamming on scb - https://phabricator.wikimedia.org/T172254#3492244 (10fgiunchedi) [09:03:36] !log Drop table click_tracking_user_properties wherever it exists - T115982 [09:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:48] T115982: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982 [09:05:51] (03CR) 10Gehel: [C: 031] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/366736 (https://phabricator.wikimedia.org/T170394) (owner: 10Volans) [09:09:10] (03CR) 10Giuseppe Lavagetto: [C: 032] Drop dead code to do diffing by shelling out. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/369611 (owner: 10Giuseppe Lavagetto) [09:12:33] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: bump version, remove catalog diff module install [puppet] - 10https://gerrit.wikimedia.org/r/369614 [09:13:50] 10Operations, 10monitoring: diamond nutcracker collector failing and spamming on scb - https://phabricator.wikimedia.org/T172254#3492315 (10fgiunchedi) The culprit seems to be this in the nutcracker config, the name is emitted unquoted in the json output: ``` servers: - 10.64.32.76:6382:1 "cp-1" - 1... [09:13:54] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: bump version, remove catalog diff module install [puppet] - 10https://gerrit.wikimedia.org/r/369614 (owner: 10Giuseppe Lavagetto) [09:14:58] (03PS1) 10Hashar: openstack: sync libvirtd.conf from Jessie package [puppet] - 10https://gerrit.wikimedia.org/r/369615 [09:16:18] (03CR) 10Aklapper: "The last comment makes no sense because that "we" does not exist." [puppet] - 10https://gerrit.wikimedia.org/r/368775 (owner: 10Aklapper) [09:25:07] Hi do you have a paste bin? [09:25:15] (03PS2) 10Hashar: openstack: libvirtd.conf from Jessie package [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/369615 [09:25:17] (03PS1) 10Hashar: openstack: libvirtd.conf from Jessie package [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/369617 [09:26:08] (03CR) 10Hashar: "Made that change to ease diff with the stock configuration from Debian Jessie. This change only update typos/white spaces." [puppet] - 10https://gerrit.wikimedia.org/r/369615 (owner: 10Hashar) [09:26:33] ShakespeareFan00: https://phabricator.wikimedia.org/paste/ or the many that you can find online [09:27:21] (03CR) 10Hashar: "Made that change to ease diff with the stock configuration from Debian Jessie. This change add the new settings, though they are all comme" [puppet] - 10https://gerrit.wikimedia.org/r/369617 (owner: 10Hashar) [09:28:02] Was STILL seeing "dial-up" level performance :( [09:28:16] (03CR) 10Volans: [C: 032] Transports: convert hosts to ClusterShell's NodeSet [software/cumin] - 10https://gerrit.wikimedia.org/r/366736 (https://phabricator.wikimedia.org/T170394) (owner: 10Volans) [09:29:44] (03Merged) 10jenkins-bot: Transports: convert hosts to ClusterShell's NodeSet [software/cumin] - 10https://gerrit.wikimedia.org/r/366736 (https://phabricator.wikimedia.org/T170394) (owner: 10Volans) [09:30:01] (03PS10) 10DCausse: Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) [09:31:32] 10Operations, 10ops-eqiad, 10Traffic: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3492337 (10ema) @Cmjohnson any news? [09:35:45] https://phabricator.wikimedia.org/P5841 [09:36:15] Perhaps someone can work through that trace output and figure out why I'm only getting dial-up performance? [09:40:44] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3492346 (10Marostegui) 05Open>03Resolved a:03Marostegui `click_tracking_user_pro... [09:40:48] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3492350 (10Marostegui) [09:46:40] (03PS1) 10Hashar: contint: Package[tox] -> Package[python-tox] [puppet] - 10https://gerrit.wikimedia.org/r/369620 (https://phabricator.wikimedia.org/T169602) [09:46:55] volans: and of course I screwed up my patch: https://gerrit.wikimedia.org/r/369620 contint: Package[tox] -> Package[python-tox] [09:46:56] :( [09:49:31] (03PS1) 10Ema: cache: nginx-lua-prometheus testing on a text node [puppet] - 10https://gerrit.wikimedia.org/r/369621 [09:54:21] hashar: oh right, sorry missed that too [09:54:37] (03CR) 10Volans: [C: 032] contint: Package[tox] -> Package[python-tox] [puppet] - 10https://gerrit.wikimedia.org/r/369620 (https://phabricator.wikimedia.org/T169602) (owner: 10Hashar) [09:55:54] hashar: merged! [09:58:25] (03CR) 10Filippo Giunchedi: [C: 031] cache: nginx-lua-prometheus testing on a text node [puppet] - 10https://gerrit.wikimedia.org/r/369621 (owner: 10Ema) [09:58:38] (03PS1) 10Giuseppe Lavagetto: Fix content diff newlines [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/369622 [09:59:18] (03CR) 10Ema: [C: 032] cache: nginx-lua-prometheus testing on a text node [puppet] - 10https://gerrit.wikimedia.org/r/369621 (owner: 10Ema) [09:59:25] (03PS2) 10Ema: cache: nginx-lua-prometheus testing on a text node [puppet] - 10https://gerrit.wikimedia.org/r/369621 [09:59:27] (03CR) 10Ema: [V: 032 C: 032] cache: nginx-lua-prometheus testing on a text node [puppet] - 10https://gerrit.wikimedia.org/r/369621 (owner: 10Ema) [09:59:53] Also when did the new UI start to roll out? [10:00:06] I didn't have connection issues before the new UI [10:00:21] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix content diff newlines [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/369622 (owner: 10Giuseppe Lavagetto) [10:02:16] volans: thanks. Rebuilding stuff from scratch again :) [10:02:36] :) [10:04:18] 10Operations, 10monitoring: diamond nutcracker collector failing and spamming on scb - https://phabricator.wikimedia.org/T172254#3492379 (10fgiunchedi) Findings and ideas so far with @Joe and @Volans: * we can't change the names without suffering key redistribution (the whole name is used for ketama) * the na... [10:04:35] 10Operations, 10monitoring: Double quotes in nutcracker config make json stats invalid json - https://phabricator.wikimedia.org/T172254#3492381 (10fgiunchedi) [10:15:39] (03CR) 10Giuseppe Lavagetto: [C: 031] ipresolve: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/367871 (owner: 10Ema) [10:16:50] 10Operations, 10monitoring: Double quotes in nutcracker config make json stats invalid json - https://phabricator.wikimedia.org/T172254#3492394 (10fgiunchedi) Reported upstream at https://github.com/twitter/twemproxy/issues/532 Production mw isn't currently affected because this bug was patched on our side in... [10:18:17] !log Rename wikigrok tables on enwiki on db1089 - T172020 [10:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:30] T172020: Drop WikiGrok tables from production - https://phabricator.wikimedia.org/T172020 [10:18:59] (03PS2) 10Ema: ipresolve: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/367871 [10:20:30] (03CR) 10Ema: [V: 032 C: 032] ipresolve: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/367871 (owner: 10Ema) [10:20:35] marostegui: do we rename and then drop or just rename? [10:20:44] (i'm following the wikigrok tables task and am curious) [10:21:52] phuedx: We normally rename the tables, leave them a few days to make sure nothing breaks or no one screams and if so, we drop them [10:22:05] cool, thanks! [10:22:24] We also backup them and keep the backups for a few weeks before removing them [10:23:31] marostegui: Hi [10:23:45] STILL getting very slow loading times on pages, which I'm not getting on non WMF sites [10:25:09] ShakespeareFan00: You might want to follow this: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue so we can get more details [10:25:17] I did [10:25:21] It's in a pastebin [10:25:25] I linked earlier [10:25:38] https://phabricator.wikimedia.org/P5841 [10:25:59] Did you create the phabricator task as that procedure suggests? so it is easier for everyone to follow and discuss? [10:26:43] I hadn't [10:27:02] ShakespeareFan00: I would suggest you do, as it is easier for discussions [10:27:02] because I felt it was reasonable to try and find some to resolve it without having to file [10:27:46] Phabricator is of course back at 'dial-up' speed :( [10:29:43] (03PS1) 10Marostegui: db-codfw.php: Depool db2051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369626 (https://phabricator.wikimedia.org/T170351) [10:32:16] ShakespeareFan00: last time I had the issue it is because Phabricator CSS ships some font files that take a while to download [10:32:44] hashar: That doesn't explain why the rest of the WMF sites are slow [10:32:51] ah [10:33:09] If you are doing 'web-fonts' , they should be locally cached... [10:33:11] based on the traceroute, it looks like your connection is too busy [10:33:21] yeah the web-fonts is what I had in mind, and indeed that is cached locally [10:33:31] PROBLEM - MegaRAID on es2013 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [10:33:34] in the paste above, the traceroute show high latency with the 2nd op [10:33:37] 2nd hop [10:33:55] So it's not my local configuration? [10:34:19] the 2nd hop being 194.159.169.245 which I locally reach in 15 ms round trip [10:34:38] Checking es2013 [10:34:52] strange [10:34:59] It happened in the morning [10:35:02] that is a relatively new one [10:35:08] the relearn was active [10:35:08] ShakespeareFan00: and I reach it in 76 miliseconds round trip from a Wikimedia server in the US [10:35:10] I disabled it [10:35:25] ah, then we should check it on al es hosts [10:35:37] ShakespeareFan00: so most probably there is something on your local installation that is consuming too much bandwidth, or your link has degraded to low bandwidth mode [10:35:58] or maybe was one of those where parts where changed/bios defaults [10:36:08] There shouldn't be anything other than web... [10:36:12] Yeah, I was checking if there is a task for that host [10:36:19] And I'm well within capacity limits [10:36:19] But it looks like a faulty BBU right now [10:36:27] BBU? [10:36:30] no need, let's disable auto learn evertwhere [10:36:40] Your mean a local exchange issue? [10:36:53] Or was that in relation to something else? [10:36:56] ShakespeareFan00: jynus and myself are troubleshooting another issue, not your case :) [10:36:58] ShakespeareFan00: my comments are for marostegui and had nothing to do with yout network issues [10:37:04] Thanks [10:37:21] hashar: I've checked locally [10:37:26] jynus: I meant that I disabled the auto-learn on es2013 this morning because it was enabled, but now it looks like the BBU is not happy [10:37:38] There shouldn't be annything (of which I am aware) that should be using a lot of bandwidth [10:37:52] And I have paranoid level monitoring [10:38:16] ShakespeareFan00: maybe you can connect to your ISP local router and it would offer some statistics ? :D [10:38:55] hashar: In the past I've not noted anything obvious and I haven't changed the configuration recently [10:39:13] jynus: I am going to force a relearn to see if it gets fixed, and if it fails again (in days or weeks…) so we can actually believe it is a faulty BBU [10:40:29] those should be under warranty, so we can open a case and forget [10:40:46] as long as we have logs to backup the fault [10:40:49] Let me confirm they warranty [10:45:08] I am seeing in the logs that it has been flapping a lot form operating normally to battery low [10:45:42] And they are under warranty indeed [10:47:26] (03CR) 10DCausse: [C: 04-1] "small typo in array key that prevents ltr-i-20 from working properly" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [10:48:20] PROBLEM - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:48:21] PROBLEM - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.167 and port 9042: Connection refused [10:48:40] PROBLEM - cassandra-a service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [10:48:40] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:48:58] https://phabricator.wikimedia.org/T172262 [10:49:12] marostegui: Failing equipment? ;) [10:49:55] !log Force a re-learn cycle on es2013 [10:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:55] 10Operations, 10Traffic, 10netops: Poor conectivity (Vodafone/THUS in UK) - https://phabricator.wikimedia.org/T172262#3492515 (10Marostegui) [10:51:08] (03PS1) 10Filippo Giunchedi: monitor: fix nutcracker diamond collector double quote escaping [puppet] - 10https://gerrit.wikimedia.org/r/369628 (https://phabricator.wikimedia.org/T172254) [10:51:13] "Event Description: Battery relearn started / Event Description: Battery relearn in progress / Event Description: Battery relearn completed" [10:51:35] last one is "Current capacity of the battery is below threshold" [10:51:40] yes, I saw that in the logs [10:51:44] after the relearn finished [10:51:51] I think that is enough to open a ticket [10:52:00] I forced another relearn, to see if it discharges again quickly [10:52:08] (agreed on the ticket - I was creating it already) [10:52:13] it doesn't matter, to me that is faulty [10:52:21] I would just ask for a replacement [10:53:05] I will depool it [10:53:17] cool thanks [10:54:40] RECOVERY - cassandra-a service on restbase-dev1004 is OK: OK - cassandra-a is active [10:54:40] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [10:54:50] tell me when created- I will reference it on the commit [10:55:20] yep, will be done in a sec [10:56:30] RECOVERY - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.167 port 9042 [10:56:30] RECOVERY - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-a valid until 2018-07-20 15:08:04 +0000 (expires in 352 days) [10:57:59] are you ok with me disabling auto learn on the other hosts? [10:58:11] 10Operations, 10ops-codfw, 10DBA: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3492537 (10Marostegui) [10:58:19] jynus: yeah, go for it and also the task ^ [10:58:23] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) (owner: 10Volans) [10:58:49] thank you very much, marostegui [10:59:07] Feel free to add more stuff or more details or whatever you think it might be helpful :) [10:59:08] volans: using tox-2.5.0 :) [10:59:09] (03CR) 10jerkins-bot: [V: 04-1] Failoid: migrate to Puppet's future parser [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) (owner: 10Volans) [10:59:32] then apparently that breaks the job on puppet.git :( [11:01:01] marostegui: one thing we can do is to schedule an autolearn to all of codfw and on eqiad on the next dc depool [11:01:13] (03CR) 10Giuseppe Lavagetto: [C: 031] Query: add multi-query support [software/cumin] - 10https://gerrit.wikimedia.org/r/366737 (https://phabricator.wikimedia.org/T170394) (owner: 10Volans) [11:01:36] That would be a good idea indeed, I am sure we will get more BBUs failing after forcing it too [11:02:10] yes [11:02:16] !log Upgraded tox on CI to 2.5.0 [11:02:17] better now that in 2 years [11:02:24] Correct! [11:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:30] RECOVERY - MegaRAID on es2013 is OK: OK: optimal, 1 logical, 12 physical, WriteBack policy [11:03:41] We will see how long it takes to fail again :) [11:04:43] 10Operations, 10ops-codfw, 10DBA: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3492564 (10Marostegui) For the record, after manually forcing the re-learn we got it back to healthy - let's see how long until it fails again: ``` ˜/icinga-wm 13:03> RECOVERY - MegaRAID on es2013 is OK: OK: optim... [11:05:10] (03PS1) 10Jcrespo: mariadb: Depool db2013 for BBU issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369629 (https://phabricator.wikimedia.org/T172265) [11:05:21] GRMBBM [11:05:25] tox default to python3 :( [11:06:25] (03CR) 10Marostegui: [C: 031] mariadb: Depool db2013 for BBU issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369629 (https://phabricator.wikimedia.org/T172265) (owner: 10Jcrespo) [11:06:51] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2013 for BBU issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369629 (https://phabricator.wikimedia.org/T172265) (owner: 10Jcrespo) [11:08:23] (03Merged) 10jenkins-bot: mariadb: Depool db2013 for BBU issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369629 (https://phabricator.wikimedia.org/T172265) (owner: 10Jcrespo) [11:08:34] (03CR) 10jenkins-bot: mariadb: Depool db2013 for BBU issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369629 (https://phabricator.wikimedia.org/T172265) (owner: 10Jcrespo) [11:11:49] (03CR) 10Marostegui: [C: 031] "Nice job!! :-)" [puppet] - 10https://gerrit.wikimedia.org/r/369397 (https://phabricator.wikimedia.org/T171928) (owner: 10Jcrespo) [11:12:19] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool es2013 (duration: 00m 47s) [11:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:07] (03PS2) 10Marostegui: db-codfw.php: Depool db2051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369626 (https://phabricator.wikimedia.org/T170351) [11:16:40] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369626 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [11:16:58] volans: you would never trust me. the python-tox debian package is broken :/ [11:17:14] !log kartik@tin Started deploy [cxserver/deploy@cf3e280]: Update cxserver to fe03ad7 [11:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:11] !log kartik@tin Finished deploy [cxserver/deploy@cf3e280]: Update cxserver to fe03ad7 (duration: 00m 56s) [11:18:12] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369626 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [11:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:21] (03CR) 10jenkins-bot: db-codfw.php: Depool db2051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369626 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [11:19:37] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2051 - T170351 (duration: 00m 46s) [11:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:47] T170351: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351 [11:20:54] (03PS1) 10Hashar: contint: revert tox installation from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/369632 (https://phabricator.wikimedia.org/T169602) [11:21:21] (03PS1) 10Marostegui: db2051.yaml: Update its socket location [puppet] - 10https://gerrit.wikimedia.org/r/369633 (https://phabricator.wikimedia.org/T170351) [11:21:23] interestingly, on es1, some hosts have auto learn disabled and others don't [11:21:26] mobrovac: Can you check error at: https://pastebin.com/wnZxe1Kn [11:21:39] mobrovac: cxserver deployment failed. [11:21:44] (03CR) 10jerkins-bot: [V: 04-1] contint: revert tox installation from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/369632 (https://phabricator.wikimedia.org/T169602) (owner: 10Hashar) [11:21:54] !log kartik@tin (no justification provided) [11:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:09] (03CR) 10jerkins-bot: [V: 04-1] db2051.yaml: Update its socket location [puppet] - 10https://gerrit.wikimedia.org/r/369633 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [11:22:16] !log kartik@tin (no justification provided) [11:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:25] (03PS1) 10Hashar: contint: upgrade tox to 2.5.0 using pip [puppet] - 10https://gerrit.wikimedia.org/r/369634 (https://phabricator.wikimedia.org/T169602) [11:23:01] hashar: That -1 I got from jenkins is related to your ongoing work? [11:23:03] so python/tox is broken (it defaults to python3) would need a revert an reinstall via : https://gerrit.wikimedia.org/r/369632 https://gerrit.wikimedia.org/r/369634 [11:23:06] yeah :( [11:23:12] ok! [11:23:16] python-tox.deb has set a trap :( [11:23:20] (03CR) 10jerkins-bot: [V: 04-1] contint: upgrade tox to 2.5.0 using pip [puppet] - 10https://gerrit.wikimedia.org/r/369634 (https://phabricator.wikimedia.org/T169602) (owner: 10Hashar) [11:23:39] if you dont mind merging the two patches I have sent that would let me regenerate the image with a proper tox version [11:23:55] (03CR) 10Marostegui: "The -1 is related to @hashar on-going work" [puppet] - 10https://gerrit.wikimedia.org/r/369633 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [11:24:00] hashar: which ones? [11:24:22] marostegui: https://gerrit.wikimedia.org/r/369632 https://gerrit.wikimedia.org/r/369634 :) [11:24:30] the first is a revert of a couple patches I made eariler [11:24:40] the later bump the version installed via pip [11:24:58] (because obviously I havent tested the deb package) [11:25:00] XD [11:25:03] ok, let me merge [11:25:25] would need to remove jenkins-bot from the list of reviewers [11:25:43] (03CR) 10Marostegui: [C: 032] contint: revert tox installation from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/369632 (https://phabricator.wikimedia.org/T169602) (owner: 10Hashar) [11:25:54] (03CR) 10Marostegui: [V: 032 C: 032] contint: revert tox installation from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/369632 (https://phabricator.wikimedia.org/T169602) (owner: 10Hashar) [11:26:24] (03CR) 10Marostegui: [V: 032 C: 032] contint: upgrade tox to 2.5.0 using pip [puppet] - 10https://gerrit.wikimedia.org/r/369634 (https://phabricator.wikimedia.org/T169602) (owner: 10Hashar) [11:27:14] hashar: both deployed [11:27:16] and regenerating the image [11:27:16] :( [11:27:18] thank you marostegui ! [11:28:09] (03CR) 10Jcrespo: [C: 031] "Remember to restart it again after setting it as master." [puppet] - 10https://gerrit.wikimedia.org/r/369633 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [11:29:17] (03CR) 10Marostegui: "Yeah, it will be moved to another rack and let it replicate normally until we decide the day to do the switchover, and then we can do all " [puppet] - 10https://gerrit.wikimedia.org/r/369633 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [11:29:30] !log setting all es hosts with disabled auto-learn bbu properties [11:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:38] Executing '/usr/local/bin/pip install -q tox==2.5.0' [11:31:29] hashar: I will do a recheck on my patch once you are done [11:33:43] !log Stop MySQL on db2051 for maintenance - T170351 [11:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:55] T170351: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351 [11:37:38] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/369633 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [11:38:08] using tox-2.5.0 from /usr/local/lib/python2.7/dist-packages/tox/__init__.pyc [11:38:28] marostegui: that is fixed :-} Thank you very much for the quick merges! [11:38:52] hashar: oh thank you! :) [11:39:03] (03PS2) 10Marostegui: db2051.yaml: Update its socket location [puppet] - 10https://gerrit.wikimedia.org/r/369633 (https://phabricator.wikimedia.org/T170351) [11:39:06] !log disabling autolearn on newest db2* hosts [11:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:16] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) (owner: 10Volans) [11:40:49] (03CR) 10Marostegui: [C: 032] db2051.yaml: Update its socket location [puppet] - 10https://gerrit.wikimedia.org/r/369633 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [11:41:47] (03CR) 10Hashar: "Sorry the job failed due to a faulty upgrade of tox on CI and I have been using this patch as a test bed." [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) (owner: 10Volans) [11:42:16] !log disabling autolearn on newest db1* hosts [11:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:08] (03PS2) 10Filippo Giunchedi: monitor: fix nutcracker diamond collector double quote escaping [puppet] - 10https://gerrit.wikimedia.org/r/369628 (https://phabricator.wikimedia.org/T172254) [11:53:10] (03PS1) 10Filippo Giunchedi: profile: fix swift udevadm reload name [puppet] - 10https://gerrit.wikimedia.org/r/369636 (https://phabricator.wikimedia.org/T171454) [11:53:45] (03PS2) 10Filippo Giunchedi: profile: fix swift udevadm reload name [puppet] - 10https://gerrit.wikimedia.org/r/369636 (https://phabricator.wikimedia.org/T171454) [11:57:25] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3492728 (10mark) No objections from me. It does add complexity somewhat and will probably add some failure modes where both ports fail to come up (LACP failin... [12:04:03] 10Operations, 10Traffic, 10netops: Poor conectivity (Vodafone/THUS in UK) - https://phabricator.wikimedia.org/T172262#3492487 (10mark) The paste (traceroute) you provided already shows exceptionally high (1s+) latency to your first hop (either your home router or the first ISP router), so this doesn't look l... [12:06:52] hashar: feel free to use my patch as test bed, no need to apologize ;) [12:07:40] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/366737 (https://phabricator.wikimedia.org/T170394) (owner: 10Volans) [12:18:01] hashar: btw the debian backport package is tox, python-tox is only a transitional dummy package for tox [12:20:04] (03PS5) 10Mark Bergsma: Add some protocol BGP class test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355415 [12:20:06] (03PS4) 10Mark Bergsma: Add bgp.ip unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355425 [12:20:08] (03PS4) 10Mark Bergsma: Add basic unit tests for protocol BGP send methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/355445 [12:20:10] (03PS3) 10Mark Bergsma: Add BGP.parseOpen unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355795 [12:20:12] (03PS3) 10Mark Bergsma: Fix IPPrefix value comparisons with different packed paddings [debs/pybal] - 10https://gerrit.wikimedia.org/r/356611 [12:20:14] (03PS3) 10Mark Bergsma: Add basic BGP.parseUpdate test case [debs/pybal] - 10https://gerrit.wikimedia.org/r/356612 [12:20:16] (03PS2) 10Mark Bergsma: Add BGP.parse{KeepAlive,Notification} test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/356620 [12:21:29] (03PS1) 10Ema: VCL: VSV00001 DoS workaround (DSA 3924-1) [puppet] - 10https://gerrit.wikimedia.org/r/369638 [12:21:48] jouncebot: refresh [12:21:50] I refreshed my knowledge about deployments. [12:21:52] jouncebot: next [12:21:53] In 0 hour(s) and 38 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170802T1300) [12:22:16] (03CR) 10jerkins-bot: [V: 04-1] VCL: VSV00001 DoS workaround (DSA 3924-1) [puppet] - 10https://gerrit.wikimedia.org/r/369638 (owner: 10Ema) [12:26:03] (03PS2) 10Ema: VCL: VSV00001 DoS workaround (DSA 3924-1) [puppet] - 10https://gerrit.wikimedia.org/r/369638 [12:26:30] (03CR) 10Hashar: [C: 031] Write to term_full_entity_id column in wb_terms table in prod too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369567 (https://phabricator.wikimedia.org/T167229) (owner: 10Ladsgroup) [12:26:33] (03CR) 10Volans: [C: 031] "LGTM according to https://varnish-cache.org/security/VSV00001.html" [puppet] - 10https://gerrit.wikimedia.org/r/369638 (owner: 10Ema) [12:26:50] (03CR) 10jerkins-bot: [V: 04-1] VCL: VSV00001 DoS workaround (DSA 3924-1) [puppet] - 10https://gerrit.wikimedia.org/r/369638 (owner: 10Ema) [12:29:34] (03PS3) 10Ema: VCL: VSV00001 DoS workaround (DSA 3924-1) [puppet] - 10https://gerrit.wikimedia.org/r/369638 [12:35:02] * elukey requests a GIF from godog for inline C in VCL code --^ [12:35:14] lol [12:35:32] (03CR) 10Ema: [C: 032] VCL: VSV00001 DoS workaround (DSA 3924-1) [puppet] - 10https://gerrit.wikimedia.org/r/369638 (owner: 10Ema) [12:35:44] https://i.imgur.com/NNLfNJa.mp4 [12:35:47] elukey: ^ [12:35:58] rotfl [12:36:19] ahahahaha [12:37:03] heheh apt in other occasions too [12:37:16] you're truly a gif-master! :-P [12:37:23] Lol [12:37:57] haha I have a shash [12:38:03] stash even [12:38:13] jouncebot: next [12:38:13] In 0 hour(s) and 21 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170802T1300) [12:41:16] !log banning elastic102[45] - T168816 [12:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:29] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [12:43:06] (03CR) 10Daniel Kinzler: [C: 031] Write to term_full_entity_id column in wb_terms table in prod too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369567 (https://phabricator.wikimedia.org/T167229) (owner: 10Ladsgroup) [12:43:20] (03CR) 10Filippo Giunchedi: [C: 032] profile: fix swift udevadm reload name [puppet] - 10https://gerrit.wikimedia.org/r/369636 (https://phabricator.wikimedia.org/T171454) (owner: 10Filippo Giunchedi) [12:43:23] (03CR) 10Hashar: [C: 031] profile: fix swift udevadm reload name [puppet] - 10https://gerrit.wikimedia.org/r/369636 (https://phabricator.wikimedia.org/T171454) (owner: 10Filippo Giunchedi) [12:43:26] (03PS3) 10Filippo Giunchedi: profile: fix swift udevadm reload name [puppet] - 10https://gerrit.wikimedia.org/r/369636 (https://phabricator.wikimedia.org/T171454) [12:46:40] (03PS2) 10Filippo Giunchedi: install_server: default to ext4 [puppet] - 10https://gerrit.wikimedia.org/r/369003 (https://phabricator.wikimedia.org/T169605) [12:49:07] (03CR) 10Filippo Giunchedi: [C: 032] install_server: default to ext4 [puppet] - 10https://gerrit.wikimedia.org/r/369003 (https://phabricator.wikimedia.org/T169605) (owner: 10Filippo Giunchedi) [12:52:37] 10Operations, 10Patch-For-Review: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955#3493041 (10fgiunchedi) [12:52:39] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Default to ext4 instead of ext3 - https://phabricator.wikimedia.org/T169605#3493039 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi [12:55:00] (03CR) 10Gehel: "The configuration as it is does not seem to work. It tags the message with "_grokparsefailure". It does mutate "type" to "wdqs", but does " [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [12:55:14] (03CR) 10Gehel: [C: 04-1] logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [12:57:03] hashar: should I do the eu swat today, or do you want to? [12:58:10] Hi [12:58:49] hi [12:58:52] (03CR) 10Giuseppe Lavagetto: rake: new rakefile specifically for CI (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/366591 (https://phabricator.wikimedia.org/T166888) (owner: 10Giuseppe Lavagetto) [12:59:21] (03PS7) 10Giuseppe Lavagetto: rake: new rakefile specifically for CI [puppet] - 10https://gerrit.wikimedia.org/r/366591 (https://phabricator.wikimedia.org/T166888) [12:59:25] zeljkof: I will handle it :) [12:59:37] (03PS1) 10Ema: 4.1.8-1wm1: new upstream, fixes VSV00001 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/369643 [12:59:38] !log banning elastic102[67] - T168816 [12:59:39] hashar: thanks [12:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:49] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [12:59:57] (03CR) 10jerkins-bot: [V: 04-1] 4.1.8-1wm1: new upstream, fixes VSV00001 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/369643 (owner: 10Ema) [12:59:59] I'm around on phone [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170802T1300). Please do the needful. [13:00:05] Amir1 and aude: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:13] audephone: the wikidata patch is already merged :) [13:00:17] Will be on pc in one or two minutes [13:00:20] Thanks [13:00:36] It's for test wikidata only at this point [13:01:50] it is on mwdebug1001 [13:01:55] but I can just sync it if it is just for testwikidatawiki [13:02:03] Yeah [13:02:23] (03PS3) 10Hashar: Write to term_full_entity_id column in wb_terms table in prod too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369567 (https://phabricator.wikimedia.org/T167229) (owner: 10Ladsgroup) [13:03:26] I'm around now [13:05:11] audephone: syncing it :) [13:05:15] Thanks [13:05:42] I can check test wikidata is not broken [13:06:15] Amir1: pushing to mwdebug1001 in a short [13:06:24] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369567 (https://phabricator.wikimedia.org/T167229) (owner: 10Ladsgroup) [13:06:42] Thanks [13:07:18] !log hashar@tin Synchronized php-1.30.0-wmf.12/extensions/Wikidata: fix constraint type checks - T169326 (duration: 02m 13s) [13:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:28] T169326: Type constraints still report false violations - https://phabricator.wikimedia.org/T169326 [13:07:29] Checking [13:07:37] audephone: done. Thank you for flying with SWAT! [13:07:55] (03CR) 10Gehel: [C: 04-1] "My above comment was missing some context:" [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [13:07:56] Looks good [13:08:09] Thanks [13:08:10] (03Merged) 10jenkins-bot: Write to term_full_entity_id column in wb_terms table in prod too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369567 (https://phabricator.wikimedia.org/T167229) (owner: 10Ladsgroup) [13:08:16] \O/ [13:08:21] (03CR) 10jenkins-bot: Write to term_full_entity_id column in wb_terms table in prod too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369567 (https://phabricator.wikimedia.org/T167229) (owner: 10Ladsgroup) [13:08:53] Amir1: it is on mwdebug1001 [13:09:02] okay, on it [13:09:04] Amir1: I guess you want to generate some magic that attempt to write tot he column ? [13:09:11] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 30 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [13:09:34] hashar: practically I just want to make sure it is not exploding [13:09:46] Hashar I think editing any item would have the column populate [13:09:49] since we are not reading from the new column it doesn't really matter now [13:09:56] We already have on test wikidata [13:10:27] 10Operations: wikitech-static.wikimedia.org certificate renewal (expiring 2017-08-09) - https://phabricator.wikimedia.org/T172285#3493146 (10herron) [13:14:10] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 1 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [13:14:29] hashar: works fine [13:15:00] \o/ [13:15:16] \O/ [13:15:18] syncing it [13:16:31] !log hashar@tin Synchronized wmf-config/Wikibase-production.php: Write to term_full_entity_id column in wb_terms table in prod too - T167229 (duration: 00m 46s) [13:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:41] T167229: Change configuration of Wikidata in production to write term_full_entity_id - https://phabricator.wikimedia.org/T167229 [13:20:05] welcome in production, the temperature outside is 29°C, fans are producing wind accordingly and the service is [UP, UP, UP, UP, UP, UP]. I wish your patch a pleasant stay! [13:20:11] audephone: Amir1 looks good on my side [13:20:39] :)))))))) [13:20:54] I tested it in several places and works just fine [13:21:14] will monitor grafana so it doesn't push anything on the server [13:23:19] (03PS2) 10Ema: 4.1.8-1wm1: new upstream, fixes VSV00001 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/369643 [13:26:40] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2001916 [13:27:47] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/369628 (https://phabricator.wikimedia.org/T172254) (owner: 10Filippo Giunchedi) [13:30:50] PROBLEM - Apache HTTP on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [13:30:52] PROBLEM - Nginx local proxy to apache on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time [13:30:55] (03CR) 10Volans: [C: 032] Query: add multi-query support [software/cumin] - 10https://gerrit.wikimedia.org/r/366737 (https://phabricator.wikimedia.org/T170394) (owner: 10Volans) [13:31:51] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.041 second response time [13:31:51] RECOVERY - Nginx local proxy to apache on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.035 second response time [13:32:48] (03Merged) 10jenkins-bot: Query: add multi-query support [software/cumin] - 10https://gerrit.wikimedia.org/r/366737 (https://phabricator.wikimedia.org/T170394) (owner: 10Volans) [13:33:13] (03CR) 10Volans: [C: 032] Transports: improve Command class [software/cumin] - 10https://gerrit.wikimedia.org/r/367823 (https://phabricator.wikimedia.org/T171679) (owner: 10Volans) [13:33:27] (03PS3) 10Filippo Giunchedi: monitor: fix nutcracker diamond collector double quote escaping [puppet] - 10https://gerrit.wikimedia.org/r/369628 (https://phabricator.wikimedia.org/T172254) [13:34:53] (03Merged) 10jenkins-bot: Transports: improve Command class [software/cumin] - 10https://gerrit.wikimedia.org/r/367823 (https://phabricator.wikimedia.org/T171679) (owner: 10Volans) [13:35:14] (03CR) 10Volans: [C: 032] CLI: add an option to ignore exit codes of commands [software/cumin] - 10https://gerrit.wikimedia.org/r/367824 (https://phabricator.wikimedia.org/T171679) (owner: 10Volans) [13:35:56] 10Operations: wikitech-static.wikimedia.org certificate renewal (expiring 2017-08-09) - https://phabricator.wikimedia.org/T172285#3493246 (10herron) Hey @Andrew it was mentioned on IRC that you're the maintainer of wikitech-static so pinging you here for awareness [13:36:57] (03Merged) 10jenkins-bot: CLI: add an option to ignore exit codes of commands [software/cumin] - 10https://gerrit.wikimedia.org/r/367824 (https://phabricator.wikimedia.org/T171679) (owner: 10Volans) [13:39:38] (03CR) 10Filippo Giunchedi: [C: 032] monitor: fix nutcracker diamond collector double quote escaping [puppet] - 10https://gerrit.wikimedia.org/r/369628 (https://phabricator.wikimedia.org/T172254) (owner: 10Filippo Giunchedi) [13:53:00] PROBLEM - nova-compute process on labvirt1011 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [13:55:00] RECOVERY - nova-compute process on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [14:02:45] 10Operations, 10ops-eqiad, 10User-fgiunchedi: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#3493388 (10Cmjohnson) @fgiunchedi The new raid battery is here...let me know when it's safe to turn off and replace. [14:10:16] (03CR) 10Ema: [C: 032] 4.1.8-1wm1: new upstream, fixes VSV00001 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/369643 (owner: 10Ema) [14:12:08] !log Stop MySQL on db2051 in order to get it ready to move to another rack - T170351 [14:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:20] T170351: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351 [14:14:04] (03PS1) 10Elukey: Add support for TLS [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/369658 (https://phabricator.wikimedia.org/T165736) [14:15:13] (03PS2) 10Elukey: Add support for TLS [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/369658 (https://phabricator.wikimedia.org/T165736) [14:16:24] (03PS5) 10Ottomata: Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) [14:16:57] (03PS6) 10Ottomata: Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) [14:18:18] (03CR) 10jerkins-bot: [V: 04-1] Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [14:18:41] (03PS1) 10Gehel: maps - tune postgresql for maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/369660 (https://phabricator.wikimedia.org/T169011) [14:20:23] (03PS7) 10Ottomata: Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) [14:23:14] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3493485 (10Cmjohnson) @chasemp Do you know the raid cfg you want? The server has (12) 3.5 6Tb disks and (2) 2.5" disk, the disk shelf has (12) 3.5"... [14:23:28] (03CR) 10Ottomata: [C: 032] Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [14:23:42] (03PS8) 10Ottomata: Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) [14:23:54] (03CR) 10Ottomata: [V: 032 C: 032] Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [14:24:07] (03PS2) 10Cmjohnson: Removing old mgmt dns entries for decom'd hosts, added 1 mgmt entry for cp1008 [dns] - 10https://gerrit.wikimedia.org/r/368896 [14:24:45] (03CR) 10Cmjohnson: [C: 032] Removing old mgmt dns entries for decom'd hosts, added 1 mgmt entry for cp1008 [dns] - 10https://gerrit.wikimedia.org/r/368896 (owner: 10Cmjohnson) [14:25:05] (03PS2) 10Gehel: maps - tune postgresql for maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/369660 (https://phabricator.wikimedia.org/T169011) [14:25:57] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493504 (10Ottomata) Alright! Luca and I have tested some things, and discussed this migration a little more. We're going t... [14:28:51] (03PS3) 10Gehel: maps - tune postgresql for maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/369660 (https://phabricator.wikimedia.org/T169011) [14:31:12] !log varnish 4.1.8-1wm1 (fixes VSV00001, DSA 3924-1) built and uploaded to apt.w.o [14:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:06] 10Operations, 10monitoring, 10Patch-For-Review: Double quotes in nutcracker config make json stats invalid json - https://phabricator.wikimedia.org/T172254#3493542 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving now as the nutcracker collector works again on scb, will reopen depending on what... [14:45:19] 10Operations, 10Beta-Cluster-Infrastructure, 10VPS-Projects, 10Release-Engineering-Team (Kanban), and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3493557 (10fgiunchedi) [14:46:15] !log upgrading cache_misc to varnish 4.1.8-1wm1 [14:46:15] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3493561 (10chasemp) >>! In T167984#3493485, @Cmjohnson wrote: > @chasemp Do you know the raid cfg you want? The server has (12) 3.5 6Tb disks and (... [14:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:47] 10Operations: ganeti2003 ipmi_sdr_cache_create: internal IPMI error - https://phabricator.wikimedia.org/T172115#3493564 (10herron) 05Open>03Resolved a:03herron Seems to have cleared on its own ganeti2003:~# /usr/local/lib/nagios/plugins/check_ipmi_sensor --noentityabsent -T Temperature -ST Temperature -... [14:47:49] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3493567 (10herron) [14:47:55] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493568 (10elukey) Note about disk config: we are going for a 12 disk RAID10 partition plus a raid1/10 root one. I can work o... [14:48:11] PROBLEM - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.167 and port 9042: Connection refused [14:48:40] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:48:41] PROBLEM - cassandra-a service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [14:48:53] PROBLEM - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:53:41] RECOVERY - cassandra-a service on restbase-dev1004 is OK: OK - cassandra-a is active [14:54:40] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [14:56:00] RECOVERY - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-a valid until 2018-07-20 15:08:04 +0000 (expires in 352 days) [14:56:20] RECOVERY - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.167 port 9042 [14:58:16] cmjohnson1: hey, did hp ship the battery for 1017 too ? anyways we can do 1016 today if you have time [14:58:28] marostegui: hello i am ready when you are [14:58:36] the new task is T171183 not T150206 btw [14:58:36] T150206: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206 [14:58:36] T171183: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171183 [14:58:45] godog: two different tickets, one just shipped it to me and the other is making me jump through some hoops [14:58:50] papaul: Hello! If you have the new IP for it, I will change it before shutting the server down [14:58:53] So it boots up with the new one [14:59:13] godog: i hope to have the other tomorrow [14:59:25] and lmk when you want to swap..I can do anytime [14:59:51] marostegui: ok give me a minutte [14:59:57] sounds good [15:00:24] 10Operations, 10Mail: Get mail relay out of Yahoo! blacklist: apply to Yahoo for whitelisting bulk mail - https://phabricator.wikimedia.org/T58414#3493604 (10herron) [15:01:33] (03CR) 10Reedy: [C: 031] "Can we get this out? :)" [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [15:02:25] cmjohnson1: ok! let's do 1016 now then, I'll shut it [15:02:34] ok [15:02:34] marostegui: 10.192.16.22 [15:02:50] papaul: cool - I will change it on the host and mediawiki files, you do dns? [15:03:04] marostegui: yes i will do DNS [15:05:49] cmjohnson1: 1016 should be off by now [15:06:10] ok [15:06:16] !log Poweroff db2051 to get it move to another rack - T169501 [15:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:29] T169501: Move some masters away from B6 - https://phabricator.wikimedia.org/T169501 [15:06:40] papaul: host going down with the IP changed [15:08:23] (03CR) 10Herron: "You bet. Will merge this shortly" [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [15:09:30] XioNoX: robh hello anyone available to do switch port configuration? [15:10:33] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change db2051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369673 (https://phabricator.wikimedia.org/T169501) [15:11:14] godog: powering up [15:11:17] papaul: hey sure, what's up [15:11:25] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Change db2051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369673 (https://phabricator.wikimedia.org/T169501) [15:11:52] papaul: when you have a minute, can you confirm this is the IP: https://gerrit.wikimedia.org/r/#/c/369673/ ? [15:13:10] XioNoX: we are moving db2051 from row C to row B will like to setup the new swithc port [15:13:11] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move some masters away from B6 - https://phabricator.wikimedia.org/T169501#3493629 (10Papaul) New switch port configuration for db2051 asw-b8-codfw ge-8/0/17 [15:13:17] XioNoX: https://phabricator.wikimedia.org/T169501 [15:14:03] !log nschaaf@tin Started deploy [recommendation-api/deploy@baa11c0]: source parameter validation [15:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:30] (03CR) 10Papaul: [C: 032] db-eqiad,db-codfw.php: Change db2051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369673 (https://phabricator.wikimedia.org/T169501) (owner: 10Marostegui) [15:16:10] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Change db2051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369673 (https://phabricator.wikimedia.org/T169501) (owner: 10Marostegui) [15:16:25] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Change db2051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369673 (https://phabricator.wikimedia.org/T169501) (owner: 10Marostegui) [15:17:25] !log nschaaf@tin Finished deploy [recommendation-api/deploy@baa11c0]: source parameter validation (duration: 03m 22s) [15:17:25] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db2051 IP - T169501 (duration: 00m 46s) [15:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:36] papaul: ge-8/0/17 is now in vlan-private1-b-codfw [15:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:44] T169501: Move some masters away from B6 - https://phabricator.wikimedia.org/T169501 [15:18:12] (03PS1) 10Papaul: DNS: Change production DNS for db2051 [dns] - 10https://gerrit.wikimedia.org/r/369675 [15:18:19] XioNoX: thank you [15:18:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db2051 IP - T169501 (duration: 00m 46s) [15:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:59] (03PS2) 10Ayounsi: Revert "Add syslog-udp for logstash testing on 11515" [puppet] - 10https://gerrit.wikimedia.org/r/369577 (owner: 10EBernhardson) [15:20:45] marostegui: can I move it now? [15:21:08] papaul: yep, all done from y side! [15:21:11] *my side [15:21:35] papaul: you want me to deploy dns? [15:21:43] marostegui: ok give me 15 minutes [15:21:55] papaul: sure! thank you! [15:22:07] (03CR) 10Ayounsi: [C: 032] Revert "Add syslog-udp for logstash testing on 11515" [puppet] - 10https://gerrit.wikimedia.org/r/369577 (owner: 10EBernhardson) [15:25:36] (03CR) 10Marostegui: [C: 031] DNS: Change production DNS for db2051 [dns] - 10https://gerrit.wikimedia.org/r/369675 (owner: 10Papaul) [15:25:52] papaul: if you want me to deploy, let me know, if not, I will let you do it :) [15:27:06] (03PS1) 10Gehel: [WIP] wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 [15:28:12] (03CR) 10jerkins-bot: [V: 04-1] [WIP] wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 (owner: 10Gehel) [15:28:30] PROBLEM - Host db2051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:30:42] (03PS2) 10Gehel: [WIP] wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 [15:30:51] (03PS3) 10Elukey: Enable support for TLS provided by librdkafka [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/369658 (https://phabricator.wikimedia.org/T165736) [15:31:21] (03CR) 10Dereckson: [C: 031] Initial configuration for hiwikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) (owner: 10Urbanecm) [15:34:28] !log mobrovac@tin Started deploy [cxserver/deploy@cf3e280]: (no justification provided) [15:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:52] !log mobrovac@tin Finished deploy [cxserver/deploy@cf3e280]: (no justification provided) (duration: 00m 22s) [15:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:40] PROBLEM - cxserver endpoints health on scb2001 is CRITICAL: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) is CRITICAL: Could not fetch url http://10.192.32.132:8080/v1/translate/en/es/Apertium: Generic connection error: HTTPConnectionPool(host=u10.192.32.132, port=8080): Max retries exceeded with url: /v1/translate/en/es/Apertium (Caused by Pro [15:37:40] on aborted., BadStatusLine(,))) [15:37:50] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using Apertium, adapt the links to target language wiki. returned the unexpected status 404 (expecting: 200) [15:37:56] marostegui: server power on [15:38:11] \o/ [15:38:33] papaul: dns merged? [15:39:00] RECOVERY - Host db2051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.20 ms [15:39:15] marostegui: you have to review and merget DNS for me [15:39:26] I reviewed it already, I will merge and deploy then [15:39:35] XioNoX: i am ready for you when you are [15:39:36] (03CR) 10Marostegui: [C: 032] DNS: Change production DNS for db2051 [dns] - 10https://gerrit.wikimedia.org/r/369675 (owner: 10Papaul) [15:39:43] papaul: what for? [15:39:49] pfw [15:39:57] papaul: oh, right [15:40:21] robh: you next i got the disks [15:40:26] (03CR) 10Ottomata: Enable support for TLS provided by librdkafka (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/369658 (https://phabricator.wikimedia.org/T165736) (owner: 10Elukey) [15:40:30] PROBLEM - puppet last run on mw1259 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:40:50] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy [15:40:54] ? [15:41:02] papaul: next for what? [15:41:09] (03PS3) 10Gehel: [WIP] wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 [15:41:18] I haven't been following this channel closely this am, been working on other stuff ;D [15:41:52] papaul: go for it, let's focus on pfw3a:xe-0/0/17 <-> fasw:xe-0/2/0, first try to unplug/replug all the optics and fibers [15:41:53] robh: https://phabricator.wikimedia.org/T171584 [15:42:16] ohhhhh [15:42:20] robh: ot the disk [15:42:21] let me get setup then, cool [15:42:21] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 33 minutes ago with 0 failures [15:42:45] XioNoX: ok doing it [15:42:55] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171183#3493733 (10fgiunchedi) Swapping by @Cmjohnson worked! ``` root@ms-be1016:~# /usr/local/lib/nagios/plugins/check_hpssacli OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1... [15:43:02] cmjohnson1: we're back on ms-be1016, thanks! [15:43:53] godog: standby..i think ms-be1017 may have just arrived...let you know in 5 mins [15:44:00] cmjohnson1: kk [15:45:24] XioNoX: unplug [15:45:53] XioNoX: i stay have green light on pfw3a [15:46:29] papaul: on the interface? [15:46:44] still shows as down on the devices [15:46:46] XioNoX: yes [15:46:56] so weird [15:48:20] PROBLEM - cassandra-b service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [15:48:30] PROBLEM - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.168 and port 9042: Connection refused [15:48:40] PROBLEM - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:49:00] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:50:06] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171183#3493740 (10fgiunchedi) [15:50:26] papaul: could you try new optics and a new fiber? [15:50:44] without running it properly for now, jsut to see if it comes up [15:51:00] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using Apertium, adapt the links to target language wiki. returned the unexpected status 404 (expecting: 200) [15:51:17] (03CR) 10Hashar: [C: 031] "Thanks for the replies! I guess we can override the current Rakefile and CI will start running this new version." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/366591 (https://phabricator.wikimedia.org/T166888) (owner: 10Giuseppe Lavagetto) [15:51:41] XioNoX: ok [15:54:00] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [15:54:20] RECOVERY - cassandra-b service on restbase-dev1004 is OK: OK - cassandra-b is active [15:55:17] XioNoX: done [15:55:31] RECOVERY - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.168 port 9042 [15:55:40] RECOVERY - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-b valid until 2018-07-20 15:08:05 +0000 (expires in 351 days) [15:55:49] godog: the battery for be1017 showed up [15:55:49] papaul: yay! it's up [15:56:10] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy [15:56:23] XioNoX: ok i have to unplug again [15:56:26] papaul: can you slowly rollback to see which part is faulty? [15:56:52] reuse previous fiber, then previous optic on A side, then previous optics on Z side? [15:57:04] XioNoX: ok [15:57:35] (03CR) 10Elukey: Enable support for TLS provided by librdkafka (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/369658 (https://phabricator.wikimedia.org/T165736) (owner: 10Elukey) [15:58:32] XioNoX: i am pluging pfw side now [15:58:47] XioNoX: done [15:59:09] only pfw side is plug in [15:59:33] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3493754 (10Cmjohnson) [16:00:06] cmjohnson1: nice, if you have time now I'll shut 1017 too? [16:00:16] yep..go for it [16:00:38] papaul: it's back down [16:03:02] XioNoX: ok i doing fasw side now and unpluging pfw side [16:03:35] 10Operations, 10Commons, 10media-storage, 10monitoring, 10User-fgiunchedi: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937#3493766 (10fgiunchedi) Chatted with @chasemp about this today, the easiest way forward seems to be setting up an emu... [16:03:36] XioNoX: done [16:03:38] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3493767 (10Cmjohnson) @robh @chasemp The servers are racked and all preliminary work done. I connected the disk shelf to the server but it's not bei... [16:04:22] papaul: still down [16:04:44] papaul: does that mean that both optics are dead? [16:06:44] (03PS10) 10Ottomata: Kafka broker profile and roles for new 'jumbo' cluster and 'simple' cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [16:07:45] (03CR) 10jerkins-bot: [V: 04-1] Kafka broker profile and roles for new 'jumbo' cluster and 'simple' cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [16:08:29] (03CR) 10BBlack: [C: 032] add "issue globalsign.com" to CAA recs [dns] - 10https://gerrit.wikimedia.org/r/369575 (https://phabricator.wikimedia.org/T155806) (owner: 10BBlack) [16:08:34] (03PS2) 10BBlack: add "issue globalsign.com" to CAA recs [dns] - 10https://gerrit.wikimedia.org/r/369575 (https://phabricator.wikimedia.org/T155806) [16:09:43] cmjohnson1: should be powered off shortly [16:10:12] (03PS11) 10Ottomata: Add new kafka broker profile, and add ssl support settings to confluent::kafka::broker. [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [16:11:03] (03CR) 10jerkins-bot: [V: 04-1] Add new kafka broker profile, and add ssl support settings to confluent::kafka::broker. [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [16:11:26] (03PS12) 10Ottomata: Add new kafka broker profile, and add ssl support settings to confluent::kafka::broker. [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [16:12:08] (03PS13) 10Ottomata: Add new kafka broker profile, add confluent ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [16:12:13] (03CR) 10jerkins-bot: [V: 04-1] Add new kafka broker profile, add confluent ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [16:13:31] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3493786 (10Cmjohnson) They are connected like this on the P441 port 1E is connected DP1 (I/O module A) of the array and port 2E goes to DP1 (I/O... [16:14:12] papaul: it's back up, did you change something? [16:14:14] XioNoX: i just plug in the new fiber with a connector [16:14:37] papaul: so the fiber is faulty? [16:16:43] (03PS1) 10Gehel: wdqs - explicit scope for systemd templates to ensure puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/369685 [16:17:01] XioNoX: what i can say when using the DAC juniper 1.0m to connect pfw to fasw it doesn't work but when using a fiber with a connector it works [16:17:15] I see [16:17:26] papaul: then please do the same for the other link [16:17:58] papaul: also let me know when/if I can disable/cleanup the previous port of db2051 [16:20:09] godog: powering up [16:20:31] marostegui: are we ok with d 2051 new location so XioNoX can clean up the previuos port ? [16:20:44] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:21:48] seems a peak of 502s [16:22:10] (03PS14) 10Ottomata: Add new kafka broker profile, add confluent ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [16:22:38] (03CR) 10Krinkle: "https://puppet-compiler.wmflabs.org/compiler02/7213/ Passed." [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:23:05] solved, last datapoints are ok [16:23:17] (03CR) 10jerkins-bot: [V: 04-1] Add new kafka broker profile, add confluent ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [16:23:49] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493822 (10RobH) @elukey: I'm happy to help with partman, but I want to confirm: These systems have dual 1TB OS disks, which... [16:24:04] XioNoX: give me a minute have to change cable ID [16:24:30] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3493823 (10chasemp) ping @madhuvishy hopefully will have some time to read up on the manuals :) [16:24:44] (03PS13) 10Krinkle: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) [16:24:51] (03CR) 10Krinkle: "Minor edits to reduce diff" [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:25:38] (03PS15) 10Ottomata: Add new kafka broker profile, add confluent ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [16:25:50] no rush, thx [16:26:04] robh: just checking the one you order is it SAS compatible? [16:26:14] ? [16:26:17] the usb disk thing? [16:26:20] yes [16:26:24] it wont let SAS run at full speed [16:26:26] but it'll work [16:26:30] ok [16:26:30] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493832 (10Ottomata) +1 @RobH [16:26:55] (03CR) 10Krinkle: "Passed - https://puppet-compiler.wmflabs.org/compiler02/7276/" [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:26:59] (03CR) 10Ottomata: "Adds listeners as expected" [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [16:27:06] SAS is the same connector/controller just a higher speed disk and transfer than traditional sata [16:27:22] so if it is plugged into a slower sata controller, you just lose the speed benefits [16:27:44] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493841 (10elukey) It seems good to me, for some reason I was under the impression that we preferred sw raid vs hw controlled... [16:27:52] SSD (until the recent shortage) was quickly replacing SAS use for us throughout our deployments. [16:28:19] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T171926#3493847 (10fgiunchedi) looks like we're back, thanks @Cmjohnson ! ``` root@ms-be1017:~# /usr/local/lib/nagios/plugins/check_hpssacli OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4... [16:29:10] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T171926#3493848 (10fgiunchedi) [16:29:12] cmjohnson1: all good \o/ thanks [16:30:31] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493852 (10Ottomata) I'm not sure of a reason to prefer software raid other than ease of management. Likely performance is b... [16:31:53] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:33:33] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [16:33:34] XioNoX: check pfwb [16:33:48] papaul: all up, thanks! [16:34:05] XioNoX: ok [16:34:19] XioNoX: anything else? [16:34:31] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171183#3493878 (10Cmjohnson) 05Open>03Resolved Replaced battery, the old part is not required to be returned. [16:34:40] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T171926#3493880 (10Cmjohnson) 05Open>03Resolved Replaced battery, the old part is not required to be returned. [16:35:56] all good for frack, thanks! [16:35:59] (03PS16) 10Ottomata: Add new kafka broker profile, add confluent ssl and rack settings [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [16:37:04] (03CR) 10jerkins-bot: [V: 04-1] Add new kafka broker profile, add confluent ssl and rack settings [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [16:39:33] 10Operations, 10ops-codfw: failing RAID disk on frdb2001 - https://phabricator.wikimedia.org/T171584#3493889 (10Papaul) Disk replacement complete. [16:39:40] (03PS1) 10Gehel: wdqs - remove upstart configuration files [puppet] - 10https://gerrit.wikimedia.org/r/369688 [16:40:26] (03PS17) 10Ottomata: Add new kafka broker profile, add confluent ssl and rack settings [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [16:40:29] (03PS2) 10Gehel: wdqs - remove upstart configuration files [puppet] - 10https://gerrit.wikimedia.org/r/369688 [16:40:31] (03PS2) 10Gehel: wdqs - explicit scope for systemd templates to ensure puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/369685 [16:42:12] (03CR) 10Ottomata: "More recent: https://puppet-compiler.wmflabs.org/compiler02/7278/" [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [16:42:31] XioNoX: it looks likd db2051 is up so you can cleanup the old switch configuration thanks [16:43:31] !log depooling and shutting down elastic102[4567] - T168816 [16:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:42] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [16:44:24] great [16:44:34] (03CR) 10Herron: [C: 032] Change wikimedia.org SPF record to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [16:44:42] (03PS3) 10Herron: Change wikimedia.org SPF record to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) [16:45:38] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(24|25|26|27).eqiad.wmnet [16:45:46] (03PS1) 10Elukey: Add instruction about how to enable TLS/SSL support [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/369689 (https://phabricator.wikimedia.org/T165736) [16:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:49] (03Abandoned) 10Elukey: Enable support for TLS provided by librdkafka [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/369658 (https://phabricator.wikimedia.org/T165736) (owner: 10Elukey) [16:47:29] cmjohnson1: elastic102[4567] are down and ready for your magic [16:50:01] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493916 (10RobH) @elukey: So we do prefer sw raid over hw raid when purchasing servers. However, servers in this particular... [16:50:09] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review, 10User-Elukey: Update Varnishkafka to support TLS encryption/authentication - https://phabricator.wikimedia.org/T165736#3493917 (10elukey) After a chat with @Ottomata we realized that applying the correct namespace (`'kafka.'` prefix to all the li... [16:50:17] (03CR) 10Aklapper: "All problematic uploads in the last 48 hours come from IPs covered by this patch. I've given up deleting any such uploads until this patch" [puppet] - 10https://gerrit.wikimedia.org/r/368775 (owner: 10Aklapper) [16:50:25] (03PS3) 10Aklapper: phabricator: Block certain mobile IP ranges from uploading files [puppet] - 10https://gerrit.wikimedia.org/r/368775 [16:50:27] 10Operations, 10Analytics-Kanban, 10Traffic, 10User-Elukey: Update Varnishkafka to support TLS encryption/authentication - https://phabricator.wikimedia.org/T165736#3493918 (10elukey) [16:51:54] (03CR) 10Ottomata: [C: 031] Add instruction about how to enable TLS/SSL support [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/369689 (https://phabricator.wikimedia.org/T165736) (owner: 10Elukey) [16:52:31] (03CR) 10Elukey: [V: 032 C: 032] Add instruction about how to enable TLS/SSL support [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/369689 (https://phabricator.wikimedia.org/T165736) (owner: 10Elukey) [16:57:33] (03PS1) 10Elukey: Improve visibility of the TLS section in varnishkafka.conf.example [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/369694 (https://phabricator.wikimedia.org/T165736) [16:57:52] (03CR) 10Elukey: [V: 032 C: 032] Improve visibility of the TLS section in varnishkafka.conf.example [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/369694 (https://phabricator.wikimedia.org/T165736) (owner: 10Elukey) [17:00:22] (03PS1) 10Gehel: redis - instance names should be strings in puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/369695 [17:01:55] 10Operations, 10Mail, 10Security: Make SPF for wikimedia.org more strict - https://phabricator.wikimedia.org/T133191#3493976 (10Reedy) [17:02:03] 10Operations, 10Mail, 10Security: Make SPF for wikimedia.org more strict - https://phabricator.wikimedia.org/T133191#2224664 (10Reedy) 05Open>03Resolved [17:14:10] (03PS1) 10Ayounsi: Define network infra ranges and allow them to send syslog to logstash [puppet] - 10https://gerrit.wikimedia.org/r/369697 (https://phabricator.wikimedia.org/T166126) [17:18:50] !log installed libmail-spf-perl and restarted spamassassin service on mx[1,2]001 - T172299 [17:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:02] T172299: Install missing spamassassin SPF dependency on mx cluster - https://phabricator.wikimedia.org/T172299 [17:22:33] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using Apertium, adapt the links to target language wiki. returned the unexpected status 404 (expecting: 200) [17:24:57] 10Operations, 10netops, 10Patch-For-Review: deploy diffscan2 - https://phabricator.wikimedia.org/T169624#3494084 (10ayounsi) 05Open>03Resolved [17:25:20] !log mobrovac@tin Started deploy [recommendation-api/deploy@8dfae34]: (no justification provided) [17:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:57] !log mobrovac@tin Finished deploy [recommendation-api/deploy@8dfae34]: (no justification provided) (duration: 01m 36s) [17:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:34] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy [17:41:16] (03PS6) 10Paladox: Gerrit: Enable manage_home in scap [puppet] - 10https://gerrit.wikimedia.org/r/369560 [17:42:19] (03CR) 10EBernhardson: [C: 031] "seems a sane approach" [puppet] - 10https://gerrit.wikimedia.org/r/369697 (https://phabricator.wikimedia.org/T166126) (owner: 10Ayounsi) [17:43:45] !log mobrovac@tin Started deploy [cxserver/deploy@f43ef96]: Revert canary scb2001 to f43ef963 [17:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:59] (03PS18) 10Ottomata: Modify confluent module to support listeners, ssl, and rack [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [17:44:01] (03PS1) 10Ottomata: Add kafka broker profiles for new jumbo and simple clusters [puppet] - 10https://gerrit.wikimedia.org/r/369700 (https://phabricator.wikimedia.org/T166162) [17:44:14] !log mobrovac@tin Finished deploy [cxserver/deploy@f43ef96]: Revert canary scb2001 to f43ef963 (duration: 00m 29s) [17:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:36] !log mobrovac@tin Started deploy [cxserver/deploy@f43ef96]: Revert canary scb2001 to f43ef963, take #2 [17:44:44] RECOVERY - cxserver endpoints health on scb2001 is OK: All endpoints are healthy [17:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:52] !log mobrovac@tin Finished deploy [cxserver/deploy@f43ef96]: Revert canary scb2001 to f43ef963, take #2 (duration: 00m 16s) [17:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:51] (03PS1) 10MaxSem: Enable HTML5 sections in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369701 [17:47:25] (03Abandoned) 10Ottomata: Puppetize TLS encryption and auth for Kafka in confluent module [puppet] - 10https://gerrit.wikimedia.org/r/355796 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [17:49:12] (03CR) 1020after4: [C: 031] phabricator: Block certain mobile IP ranges from uploading files [puppet] - 10https://gerrit.wikimedia.org/r/368775 (owner: 10Aklapper) [17:51:18] (03PS19) 10Ottomata: Modify confluent module to support listeners, ssl, and rack [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [17:54:06] (03CR) 10EBernhardson: [C: 031] "Code that uses this is merged, will ship in wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368776 (https://phabricator.wikimedia.org/T171803) (owner: 10DCausse) [17:54:10] (03PS20) 10Ottomata: Modify confluent module to support listeners, ssl, and rack [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [17:54:23] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/7281/" [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [17:56:15] !log restart kafka1012 broker with listeners=PLAINTEXT://:9092 to verify https://gerrit.wikimedia.org/r/#/c/356232/ before merge. This should be a functional no-op [17:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170802T1800). Please do the needful. [18:00:04] Pchelolo, RoanKattouw, and Smalyshev: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:15] I'm here [18:00:34] here [18:00:45] I'm here but will be doing CREDIT presentation, so may be unavailable for ~5 mins [18:01:28] (03CR) 10Ottomata: [C: 032] "Verified to be a funcitonal no-op on kafka1012." [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [18:01:41] I can SWAT [18:03:03] (03PS2) 10Thcipriani: JobQueueEventBus: Enable job events in group0 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368258 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:03:09] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368258 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:04:19] (03PS1) 10Ottomata: Move listeners up in config file - no op [puppet] - 10https://gerrit.wikimedia.org/r/369704 (https://phabricator.wikimedia.org/T166162) [18:04:46] (03Merged) 10jenkins-bot: JobQueueEventBus: Enable job events in group0 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368258 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:06:11] (03CR) 10Ottomata: [C: 032] Move listeners up in config file - no op [puppet] - 10https://gerrit.wikimedia.org/r/369704 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [18:07:05] ok, done with CREDIT, back here [18:07:15] Pchelolo: you change is live on mwdebug1002, check please [18:07:25] Checking thcipriani [18:08:53] thcipriani: All looks good now [18:08:59] (03PS2) 10Thcipriani: Enable HTML5 sections in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369701 (owner: 10MaxSem) [18:09:01] (03CR) 10jenkins-bot: JobQueueEventBus: Enable job events in group0 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368258 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:09:12] Pchelolo: ok, going live [18:11:58] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:368258|JobQueueEventBus: Enable job events in group0 wikis]] T163380 Part I (duration: 00m 47s) [18:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:09] T163380: Support posting Jobs to EventBus simultaneously with normal job processing - https://phabricator.wikimedia.org/T163380 [18:13:05] !log thcipriani@tin Synchronized wmf-config/jobqueue.php: SWAT: [[gerrit:368258|JobQueueEventBus: Enable job events in group0 wikis]] T163380 Part II (duration: 00m 47s) [18:13:11] (03PS2) 10Ottomata: Add kafka broker profiles for new jumbo and simple clusters [puppet] - 10https://gerrit.wikimedia.org/r/369700 (https://phabricator.wikimedia.org/T166162) [18:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:15] ^ Pchelolo live now [18:13:17] gehel: elastic102[4567] are finished [18:13:29] \o/ getting close :) [18:14:13] (03CR) 10jerkins-bot: [V: 04-1] Add kafka broker profiles for new jumbo and simple clusters [puppet] - 10https://gerrit.wikimedia.org/r/369700 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [18:14:14] cmjohnson1: thanks! I'll check and repool them. [18:14:23] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369701 (owner: 10MaxSem) [18:14:38] cmjohnson1: how long are you still around? Does it make sense to start draining the next batch? [18:15:12] I plan to leave in a 2 or so more hours [18:15:32] cmjohnson1: ok, so most probably too short [18:15:39] SMalyshev: you change is live on mwdebug1002, check please [18:15:45] (03Merged) 10jenkins-bot: Enable HTML5 sections in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369701 (owner: 10MaxSem) [18:15:46] checking [18:16:29] thcipriani: yay, working! [18:16:35] SMalyshev: :) going live [18:16:56] (03CR) 10jenkins-bot: Enable HTML5 sections in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369701 (owner: 10MaxSem) [18:18:05] !log un-banning and repooling elastic102[4567] - T168816 [18:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:17] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [18:18:34] !log thcipriani@tin Synchronized php-1.30.0-wmf.12/includes/specials/SpecialUndelete.php: SWAT: [[gerrit:369696|Fix Special:Undelete search - use variable and not request param]] (duration: 00m 46s) [18:18:42] ^ SMalyshev live now [18:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:44] (03PS3) 10Ottomata: Add kafka broker profiles for new jumbo and simple clusters [puppet] - 10https://gerrit.wikimedia.org/r/369700 (https://phabricator.wikimedia.org/T166162) [18:18:46] (03CR) 10Dzahn: [C: 032] Gerrit: Enable manage_home in scap [puppet] - 10https://gerrit.wikimedia.org/r/369560 (owner: 10Paladox) [18:18:53] thcipriani: thanks! [18:18:54] (03PS7) 10Dzahn: Gerrit: Enable manage_home in scap [puppet] - 10https://gerrit.wikimedia.org/r/369560 (owner: 10Paladox) [18:19:02] thanks mutante :) [18:19:27] (03PS4) 10Ottomata: Add kafka broker profiles for new jumbo and simple clusters [puppet] - 10https://gerrit.wikimedia.org/r/369700 (https://phabricator.wikimedia.org/T166162) [18:20:36] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(24|25|26|27).eqiad.wmnet [18:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:49] Thank you [18:21:09] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: beta-only change [[gerrit:369701|Enable HTML5 sections in betalabs]] (duration: 00m 46s) [18:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:27] ^ MaxSem that should go out to beta cluster with the next beta-scap-eqiad run [18:21:43] thcipriani, confirm it's deployed and works [18:21:48] nice :) [18:22:02] RoanKattouw: ping for SWAT [18:23:02] Oh hey, sorry [18:23:06] thcipriani: Here now [18:23:15] Sorry for spacing out [18:23:17] np :) [18:25:33] RainbowSprinkles ^^ ssh for gerrit2 should now work with scap hopefully :) [18:26:46] 10Operations, 10Commons, 10media-storage, 10monitoring, 10User-fgiunchedi: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937#3494401 (10chasemp) @fgiunchedi ack, we don't have a ton of checks running w/ concurrency but @15m intervals seems s... [18:28:17] Notice: /Stage[main]/Gerrit::Jetty/Ssh::Userkey[gerrit2-scap]/File[/etc/ssh/userkeys/gerrit2.d/]/ensure: created [18:28:20] Notice: /Stage[main]/Gerrit::Jetty/Ssh::Userkey[gerrit2-scap]/File[/etc/ssh/userkeys/gerrit2.d/gerrit-scap]/ensure: created [18:28:23] paladox: RainbowSprinkles ^ [18:28:29] that created the key now [18:28:29] thanks :) [18:29:16] * paladox wonders will scap deploy work now :) [18:30:37] (03PS1) 10Ladsgroup: Add copyright info for Wikidata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369707 (https://phabricator.wikimedia.org/T112606) [18:35:20] (03CR) 10Pnorman: [C: 031] maps - tune postgresql for maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/369660 (https://phabricator.wikimedia.org/T169011) (owner: 10Gehel) [18:38:20] (03CR) 10Ottomata: [C: 032] "This is not applied anywhere, but I want to use this in labs to test some things. No op in prod." [puppet] - 10https://gerrit.wikimedia.org/r/369700 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [18:38:27] (03PS5) 10Ottomata: Add kafka broker profiles for new jumbo and simple clusters [puppet] - 10https://gerrit.wikimedia.org/r/369700 (https://phabricator.wikimedia.org/T166162) [18:39:06] RoanKattouw: change is live on mwdebug1002, check please [18:39:34] thcipriani: It's a no-op under the current config so I can't verify in prod [18:39:44] It will become clear whether it works when I turn off the $wg var tomorrow [18:39:52] okie doke, going live [18:41:59] !log thcipriani@tin Synchronized php-1.30.0-wmf.12/includes/specials/SpecialRecentchanges.php: SWAT: [[gerrit:369574|Follow-up 31be7d0: send tags list if experimental mode is disabled]] (duration: 00m 47s) [18:42:05] ^ RoanKattouw live now [18:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:06] (03CR) 10Herron: [C: 032] phabricator: Block certain mobile IP ranges from uploading files [puppet] - 10https://gerrit.wikimedia.org/r/368775 (owner: 10Aklapper) [18:45:11] mutante: paladox RainbowSprinkles I've been wondering the error we saw yesterday when trying to ssh, "too many authentication failures" and I've been wondering what happens when keyholder has more keys in it than maxauthtries allows on sshd on the target server. I don't know, honestly. I guess we'll find out :) [18:45:27] :) [18:46:19] thcipriani: we'll find out yea, but until just now there was just no public key on the gerrit server side [18:46:33] right [18:46:43] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [18:47:20] (03PS4) 10Herron: phabricator: Block certain mobile IP ranges from uploading files [puppet] - 10https://gerrit.wikimedia.org/r/368775 (owner: 10Aklapper) [18:48:45] just throwing out a prophecy of doom, as you do :) [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170802T1900). Please do the needful. [19:01:02] (03CR) 1020after4: "looks good, the change is applied on iridium and phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/368775 (owner: 10Aklapper) [19:02:45] (03PS1) 10Ayounsi: Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/369710 (https://phabricator.wikimedia.org/T83992) [19:03:54] thcipriani: hah, i checked.. MaxAuthTries: 7 Keys: 8 :o [19:03:57] (03PS2) 10Ayounsi: Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/369710 (https://phabricator.wikimedia.org/T83992) [19:04:02] we'll see later when we try again [19:04:23] oh good :) [19:17:35] (03PS1) 10Ottomata: Add (unused) ssl params in confluent::kafka::broker [puppet] - 10https://gerrit.wikimedia.org/r/369713 (https://phabricator.wikimedia.org/T166162) [19:21:15] twentyafterfour thcipriani: Is the train going ahead today? [19:21:25] I ask because jouncebot announced it 20 minutes ago and nothing appears to have happened since [19:21:29] RoanKattouw: yes I just submitted the patch [19:21:51] Oh OK [19:22:09] Maybe the bot is being weird then [19:22:29] not sure why I don't see anything from wikibugs... https://gerrit.wikimedia.org/r/#/c/369714/ [19:22:34] (03PS1) 1020after4: group1 wikis to 1.30.0-wmf.12 refs refs T168053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369714 [19:22:36] (03CR) 1020after4: [C: 032] group1 wikis to 1.30.0-wmf.12 refs refs T168053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369714 (owner: 1020after4) [19:22:38] oh there it is [19:23:10] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.12 refs refs T168053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369714 (owner: 1020after4) [19:23:20] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.12 refs refs T168053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369714 (owner: 1020after4) [19:23:38] (03CR) 10Ottomata: [C: 032] Add (unused) ssl params in confluent::kafka::broker [puppet] - 10https://gerrit.wikimedia.org/r/369713 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [19:23:44] (03CR) 10Ottomata: [C: 032] "No op https://puppet-compiler.wmflabs.org/compiler02/7283/" [puppet] - 10https://gerrit.wikimedia.org/r/369713 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [19:23:48] (03PS2) 10Ottomata: Add (unused) ssl params in confluent::kafka::broker [puppet] - 10https://gerrit.wikimedia.org/r/369713 (https://phabricator.wikimedia.org/T166162) [19:23:50] (03CR) 10Ottomata: [V: 032 C: 032] Add (unused) ssl params in confluent::kafka::broker [puppet] - 10https://gerrit.wikimedia.org/r/369713 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [19:23:58] !log group1 wikis to 1.30.0-wmf.12 refs T168053 [19:24:08] twentyafterfour i think wikibugs is just being slow today [19:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:11] T168053: 1.30.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T168053 [19:24:23] ok syncing [19:24:46] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.12 refs refs T168053 [19:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:13] (03PS1) 10Cmjohnson: Adding dns entries for wdqs100[45], druid100[45],labmon1002 [dns] - 10https://gerrit.wikimedia.org/r/369717 [19:25:46] oh new errors: [19:25:47] [079eca5ca8a7c35d839b41ef] /rpc/RunJobs.php?wiki=cawiki&type=wikibase-InjectRCRecords&maxtime=60&maxmem=300M Wikimedia\Assert\ParameterAssertionException from line 63 of /srv/mediawiki/php-1.30.0-wmf.12/vendor/wikimedia/assert/src/Assert.php: Bad value [19:26:28] gj wikidata [19:27:19] /srv/mediawiki/php-1.30.0-wmf.12/extensions/Wikidata/extensions/Wikibase/client/includes/Changes/InjectRCRecordsJob.php [19:29:14] (03PS1) 10Ottomata: Use 0.11 version in kafka broker profile [puppet] - 10https://gerrit.wikimedia.org/r/369719 (https://phabricator.wikimedia.org/T166162) [19:29:32] aude: ^ [19:29:37] (03PS2) 10Ottomata: Use 0.11 version in kafka broker profile [puppet] - 10https://gerrit.wikimedia.org/r/369719 (https://phabricator.wikimedia.org/T166162) [19:31:10] (03PS1) 10Kaldari: Adding RTL database list for project with default RTL languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) [19:32:16] (03CR) 10Ottomata: [C: 032] Use 0.11 version in kafka broker profile [puppet] - 10https://gerrit.wikimedia.org/r/369719 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [19:32:46] (03CR) 10jerkins-bot: [V: 04-1] Adding RTL database list for project with default RTL languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [19:35:17] (03CR) 10RobH: [C: 031] "looks good, but seems to be missing labmon production dns entries" [dns] - 10https://gerrit.wikimedia.org/r/369717 (owner: 10Cmjohnson) [19:39:06] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for wdqs100[45], druid100[45],labmon1002 [dns] - 10https://gerrit.wikimedia.org/r/369717 (owner: 10Cmjohnson) [19:39:29] (03PS2) 10Kaldari: Adding RTL database list for project with default RTL languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) [19:40:37] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban: rack/setup/install druid100[456].eqiad.wmnet - https://phabricator.wikimedia.org/T171626#3494736 (10Cmjohnson) [19:40:57] (03CR) 10jerkins-bot: [V: 04-1] Adding RTL database list for project with default RTL languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [19:41:09] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3494743 (10Cmjohnson) [19:41:46] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3276770 (10Cmjohnson) [19:44:07] ok I'm going to roll back, the errors are numerous enough to warrant it I think [19:44:14] error rate more than doubled [19:45:15] (03PS1) 1020after4: group1 wikis to 1.30.0-wmf.11 refs T168053 - rollback due to T172320 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369722 [19:45:17] (03CR) 1020after4: [C: 032] group1 wikis to 1.30.0-wmf.11 refs T168053 - rollback due to T172320 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369722 (owner: 1020after4) [19:46:45] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.11 refs T168053 - rollback due to T172320 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369722 (owner: 1020after4) [19:48:03] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:50:05] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.11 refs T168053 - rollback due to T172320 [19:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:17] T172320: Error in Wikibase/client/includes/Changes/InjectRCRecordsJob.php line 120: Bad value for parameter $params: $params['change'] not set. - https://phabricator.wikimedia.org/T172320 [19:50:18] T168053: 1.30.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T168053 [19:51:53] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.008 second response time [19:52:39] Train is blocked by https://phabricator.wikimedia.org/T172320 [19:53:07] (03PS3) 10Kaldari: Adding RTL database list for project with default RTL languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) [19:55:26] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.11 refs T168053 - rollback due to T172320 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369722 (owner: 1020after4) [19:59:36] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3494814 (10Krinkle) [19:59:47] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3224448 (10Krinkle) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170802T2000). Please do the needful. [20:00:11] no parsoid deploy [20:00:13] Nothing for ORES today [20:00:28] 10Operations, 10ops-eqiad: Decommission WMF3248 (old R510) - https://phabricator.wikimedia.org/T172323#3494819 (10Cmjohnson) [20:03:51] (03PS1) 10Ottomata: Update kafka server.properties to work with defaults [puppet] - 10https://gerrit.wikimedia.org/r/369724 (https://phabricator.wikimedia.org/T166162) [20:05:00] (03CR) 10jerkins-bot: [V: 04-1] Update kafka server.properties to work with defaults [puppet] - 10https://gerrit.wikimedia.org/r/369724 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [20:06:13] (03PS2) 10Ottomata: Update kafka server.properties to work with defaults [puppet] - 10https://gerrit.wikimedia.org/r/369724 (https://phabricator.wikimedia.org/T166162) [20:09:40] (03CR) 10Ottomata: [C: 032] "No-op (only comment changes in server.properties) https://puppet-compiler.wmflabs.org/compiler02/7284/" [puppet] - 10https://gerrit.wikimedia.org/r/369724 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [20:12:12] (03PS1) 10Cmjohnson: Adding mgmt and production dns for kafka-jumbo100[1=6] T167992 [dns] - 10https://gerrit.wikimedia.org/r/369725 [20:13:17] woohoo [20:19:04] !log bsitzmann@tin Started deploy [mobileapps/deploy@3b61ced]: Update mobileapps to 2d8e8f6 (T170325) [20:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:16] T170325: Bring back pronunciation tests - https://phabricator.wikimedia.org/T170325 [20:22:41] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 15 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3494940 (10DStrine) [20:24:30] (03CR) 10Dzahn: [C: 031] Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/369710 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [20:24:42] !log bsitzmann@tin Finished deploy [mobileapps/deploy@3b61ced]: Update mobileapps to 2d8e8f6 (T170325) (duration: 05m 38s) [20:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:53] T170325: Bring back pronunciation tests - https://phabricator.wikimedia.org/T170325 [20:24:53] (03PS1) 10Ottomata: Manage /srv/kafka to be nice in profile, don't render zk settings if undef [puppet] - 10https://gerrit.wikimedia.org/r/369727 (https://phabricator.wikimedia.org/T166162) [20:27:55] (03CR) 10EBernhardson: CirrusSearch configuration for LTR AB test (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [20:28:08] (03PS4) 10EBernhardson: CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) [20:28:41] (03CR) 10Ottomata: [C: 032] Manage /srv/kafka to be nice in profile, don't render zk settings if undef [puppet] - 10https://gerrit.wikimedia.org/r/369727 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [20:30:08] (03PS1) 10Ottomata: Specify librkafka version in Stretch (so kafkacat installs properly) [puppet] - 10https://gerrit.wikimedia.org/r/369728 (https://phabricator.wikimedia.org/T166162) [20:31:18] (03CR) 10Ottomata: [C: 032] Specify librkafka version in Stretch (so kafkacat installs properly) [puppet] - 10https://gerrit.wikimedia.org/r/369728 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [20:40:23] (03PS3) 10Dzahn: add IPv6 for labtestservices* [dns] - 10https://gerrit.wikimedia.org/r/365882 [20:41:04] (03CR) 10Dzahn: "amended it so it does ONLY labtestservices, not also labweb.. just touching "test" for now" [dns] - 10https://gerrit.wikimedia.org/r/365882 (owner: 10Dzahn) [20:42:35] (03CR) 10Dzahn: "also fixes reverse DNS for existing labtestweb2001" [dns] - 10https://gerrit.wikimedia.org/r/365882 (owner: 10Dzahn) [20:42:48] (03PS4) 10Dzahn: add IPv6 for labtestservices* [dns] - 10https://gerrit.wikimedia.org/r/365882 [20:43:24] (03CR) 10Dzahn: [C: 032] add IPv6 for labtestservices* [dns] - 10https://gerrit.wikimedia.org/r/365882 (owner: 10Dzahn) [20:47:09] (03CR) 10Dzahn: "[radon:~] $ host labtestservices2001.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/365882 (owner: 10Dzahn) [20:47:23] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [20:49:52] fixing it [20:49:52] (03PS1) 10Dzahn: fix reverse IPv6 for labtestweb2001 [dns] - 10https://gerrit.wikimedia.org/r/369731 [20:51:13] (03CR) 10Dzahn: [C: 032] fix reverse IPv6 for labtestweb2001 [dns] - 10https://gerrit.wikimedia.org/r/369731 (owner: 10Dzahn) [20:58:55] (03PS3) 10Ayounsi: Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/369710 (https://phabricator.wikimedia.org/T83992) [21:00:21] (03CR) 10Ayounsi: [C: 032] Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/369710 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [21:00:25] (03PS1) 10Dzahn: Revert "add IPv6 for labtestservices*" [dns] - 10https://gerrit.wikimedia.org/r/369732 [21:01:21] (03PS1) 10Ottomata: Make sure librdkafka1 is installed before kafkacat [puppet] - 10https://gerrit.wikimedia.org/r/369733 (https://phabricator.wikimedia.org/T166162) [21:02:08] (03PS2) 10Dzahn: Revert "add IPv6 for labtestservices*" [dns] - 10https://gerrit.wikimedia.org/r/369732 [21:02:19] (03PS3) 10Dzahn: Revert "add IPv6 for labtestservices*" [dns] - 10https://gerrit.wikimedia.org/r/369732 [21:02:34] (03CR) 10Ottomata: [C: 032] Make sure librdkafka1 is installed before kafkacat [puppet] - 10https://gerrit.wikimedia.org/r/369733 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [21:02:41] (03PS2) 10Ottomata: Make sure librdkafka1 is installed before kafkacat [puppet] - 10https://gerrit.wikimedia.org/r/369733 (https://phabricator.wikimedia.org/T166162) [21:02:43] (03CR) 10Ottomata: [V: 032 C: 032] Make sure librdkafka1 is installed before kafkacat [puppet] - 10https://gerrit.wikimedia.org/r/369733 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [21:03:04] (03CR) 10Dzahn: [C: 032] Revert "add IPv6 for labtestservices*" [dns] - 10https://gerrit.wikimedia.org/r/369732 (owner: 10Dzahn) [21:09:39] (03PS1) 10Ottomata: Circumvent kafkcat circular dependnecy on librdkafka1 [puppet] - 10https://gerrit.wikimedia.org/r/369745 (https://phabricator.wikimedia.org/T166162) [21:10:40] (03PS2) 10Ottomata: Circumvent kafkcat circular dependnecy on librdkafka1 [puppet] - 10https://gerrit.wikimedia.org/r/369745 (https://phabricator.wikimedia.org/T166162) [21:12:17] (03PS1) 10Ayounsi: Revert "Icinga: add check_bfd check (part 1)" [puppet] - 10https://gerrit.wikimedia.org/r/369746 [21:13:55] (03PS2) 10Ayounsi: Revert "Icinga: add check_bfd check (part 1)" [puppet] - 10https://gerrit.wikimedia.org/r/369746 [21:13:56] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-snimpy] [21:14:07] (03PS3) 10Ayounsi: Revert "Icinga: add check_bfd check (part 1)" [puppet] - 10https://gerrit.wikimedia.org/r/369746 [21:14:37] (03CR) 10Ottomata: [C: 032] Circumvent kafkcat circular dependnecy on librdkafka1 [puppet] - 10https://gerrit.wikimedia.org/r/369745 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [21:18:10] (03CR) 10Ayounsi: [C: 032] Revert "Icinga: add check_bfd check (part 1)" [puppet] - 10https://gerrit.wikimedia.org/r/369746 (owner: 10Ayounsi) [21:18:21] (03PS4) 10Ayounsi: Revert "Icinga: add check_bfd check (part 1)" [puppet] - 10https://gerrit.wikimedia.org/r/369746 [21:18:53] (03CR) 10Ayounsi: [V: 032 C: 032] Revert "Icinga: add check_bfd check (part 1)" [puppet] - 10https://gerrit.wikimedia.org/r/369746 (owner: 10Ayounsi) [21:22:48] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3495117 (10Ottomata) FYI: These nodes should be installed with Debian Stretch. [21:23:14] 10Operations, 10monitoring: broken dependencies for python-snimpy on jessie - https://phabricator.wikimedia.org/T172330#3495119 (10ayounsi) [21:31:21] hi twentyafterfour [21:31:41] aude: hi! :) [21:31:56] I cc'd you on T172320 [21:31:56] think i know what might be the problem [21:31:56] T172320: Error in Wikibase/client/includes/Changes/InjectRCRecordsJob.php line 120: Bad value for parameter $params: $params['change'] not set. - https://phabricator.wikimedia.org/T172320 [21:32:12] or not sure [21:32:28] the error I was seeiing seems to be related to https://phabricator.wikimedia.org/rEWBA2987fe5bcf6223d03835670f7a6832e6a41a51ef [21:34:48] not sure if it is quick/easy fix for not [21:36:02] oh daniel commented [21:36:25] aude: if it's quick/easy I'd be happy to deploy the patch. If not, it's ok. It's getting late in european time zones and the wikis are rolled back to wmf.11 for now.. [21:36:31] oh I hadn't seen the comment yet either... [21:36:33] * twentyafterfour reads [21:37:44] even if i can come up with a patch for this, no one would be able to review until tomorrow morning [21:38:05] aude: it's ok there is no need to stress over it [21:38:08] ok [21:38:16] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:38:17] thank you for looking into it though! [21:38:23] the other option is to go back to the previous branch of wikidata/wikibase [21:38:25] PROBLEM - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:38:35] PROBLEM - cassandra-a service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [21:38:36] but not sure that's good since it hasn't been on test wikidata etc [21:38:45] PROBLEM - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.167 and port 9042: Connection refused [21:38:51] and the transaction issue would remain, which daniel tries to fix with his change [21:38:54] if we need to do that then we can worry about it tomorrow [21:38:56] ok [21:39:06] for now we remain on wmf.11 [21:39:09] k [21:39:43] thanks again! [21:40:02] sure [21:42:25] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [21:53:45] RECOVERY - cassandra-a service on restbase-dev1004 is OK: OK - cassandra-a is active [21:54:25] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [21:55:35] RECOVERY - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-a valid until 2018-07-20 15:08:04 +0000 (expires in 351 days) [21:55:45] RECOVERY - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.167 port 9042 [22:15:17] (03PS5) 10ArielGlenn: write dump output files to temporary location, move in place when done [dumps] - 10https://gerrit.wikimedia.org/r/368744 (https://phabricator.wikimedia.org/T169849) [22:15:45] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:23:19] (03CR) 10Huji: [C: 031] "Clearly this is correct as of now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [22:24:16] (03CR) 10Kaldari: "@Huji: That is a great question and I don't know the answer." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [22:36:42] (03CR) 10Huji: [C: 031] "Do we have any other dblists in this project? If yes, do we have processes to ensure they are kept up to date?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [22:38:51] (03CR) 10Huji: [C: 031] "One more thing. Can we add a README.md file to https://github.com/wikimedia/operations-mediawiki-config/tree/master/dblists explaining wha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [22:38:57] (03CR) 10Reedy: "You could write a unit test... that iterates over all wikis, works out their native language, checks if it's rtl or ltr, and then checks i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [22:39:03] (03CR) 10MaxSem: "We can modify https://github.com/wikimedia/operations-mediawiki-config/blob/master/refresh-dblist to generate this list." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [22:39:45] (03CR) 10Kaldari: "OK, I'll create a new ticket to implement a way of keeping this up to date." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [22:41:27] (03CR) 10EBernhardson: Switch this repo to a deb package (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [22:42:12] (03CR) 10MaxSem: [C: 04-1] "The new file needs to be registered at https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [22:51:26] (03PS4) 10Kaldari: Adding RTL database list for project with default RTL languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) [22:51:39] (03PS1) 10Reedy: phpcs for refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369808 [23:00:04] (03PS1) 10Rush: openstack: clientlib refactor [puppet] - 10https://gerrit.wikimedia.org/r/369809 (https://phabricator.wikimedia.org/T171494) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170802T2300). [23:00:04] ebernhardson, kaldari, and Amir1: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:15] here [23:00:23] hi [23:00:29] o/ [23:00:31] i can go last, since i need a scap [23:00:44] (although they are reasonably fast these days, faster than jenkins sometimes :P) [23:01:19] ohai 4pm [23:05:04] anybody doing swat today? [23:06:22] well, i guess i can [23:06:26] PROBLEM - Disk space on labcontrol1001 is CRITICAL: DISK CRITICAL - free space: / 1661 MB (3% inode=87%) [23:07:43] (03CR) 10EBernhardson: [C: 032] Add copyright info for Wikidata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369707 (https://phabricator.wikimedia.org/T112606) (owner: 10Ladsgroup) [23:08:31] (03CR) 10BryanDavis: "Exposing this table approved by bawolff on behalf of the WMF Security Team in T170927#3494567" [puppet] - 10https://gerrit.wikimedia.org/r/365969 (https://phabricator.wikimedia.org/T170927) (owner: 10Lucas Werkmeister (WMDE)) [23:08:34] (03CR) 10EBernhardson: [C: 032] Adding RTL database list for project with default RTL languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [23:09:09] (03Merged) 10jenkins-bot: Add copyright info for Wikidata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369707 (https://phabricator.wikimedia.org/T112606) (owner: 10Ladsgroup) [23:09:19] (03CR) 10jenkins-bot: Add copyright info for Wikidata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369707 (https://phabricator.wikimedia.org/T112606) (owner: 10Ladsgroup) [23:09:23] Amir1: yours is about to go out [23:09:33] Thanks! [23:09:58] (03Merged) 10jenkins-bot: Adding RTL database list for project with default RTL languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [23:10:44] Amir1: code is up on mwdebug1001 [23:12:00] (03CR) 10jenkins-bot: Adding RTL database list for project with default RTL languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369720 (https://phabricator.wikimedia.org/T172305) (owner: 10Kaldari) [23:12:01] ebernhardson: okay, it does have strange effects, please revert :( [23:12:26] Amir1: ok [23:12:41] (03PS1) 10EBernhardson: Revert "Add copyright info for Wikidata API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369811 [23:12:52] (03CR) 10EBernhardson: [C: 032] Revert "Add copyright info for Wikidata API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369811 (owner: 10EBernhardson) [23:13:00] Thanks and sorry [23:13:07] no worries, thats why we check :) [23:14:23] (03Merged) 10jenkins-bot: Revert "Add copyright info for Wikidata API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369811 (owner: 10EBernhardson) [23:14:35] (03CR) 10jenkins-bot: Revert "Add copyright info for Wikidata API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369811 (owner: 10EBernhardson) [23:15:24] Amir1: I notice for the text... in initialise settings it's very short for other wikis [23:15:31] kaldari: your up on mwdebug1001 now [23:15:56] kaldari: i'm not sure how much can be checked directly.. anything? [23:16:18] Reedy: yeah, but licensing in wikidata is complex [23:16:28] some namespaces are PD, some CC-BY-SA [23:16:29] ebernhardson: nope, no way to test that one [23:16:35] RECOVERY - Disk space on labcontrol1001 is OK: DISK OK [23:17:01] It wouldn't be bad to have namespace-based copyright footer [23:17:22] ebernhardson: I'll just make sure nothing's on fire :) [23:17:32] !log ebernhardson@tin Started scap: dblists/rtl.dblist T172305: Adding RTL database list for project with default RTL languages [23:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:46] T172305: Create an RTL database list - https://phabricator.wikimedia.org/T172305 [23:18:13] ebernhardson: Everything looks good to me [23:18:21] as far as I could tell :) [23:18:38] I need to go, see you tomorrow [23:18:39] o/ [23:24:58] (03CR) 10Jforrester: phpcs for refresh-dblist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369808 (owner: 10Reedy) [23:26:06] (03CR) 10Reedy: phpcs for refresh-dblist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369808 (owner: 10Reedy) [23:26:24] (03PS2) 10Reedy: phpcs for refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369808 [23:37:33] (03PS1) 10Reedy: Remove wikimania2015wiki from wmgCentralAuthLoginIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369814 [23:52:55] PROBLEM - puppet last run on db1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:53:53] !log ebernhardson@tin Finished scap: dblists/rtl.dblist T172305: Adding RTL database list for project with default RTL languages (duration: 36m 21s) [23:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:04] T172305: Create an RTL database list - https://phabricator.wikimedia.org/T172305