[00:14:39] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.599 second response time [00:17:20] PROBLEM - Nginx local proxy to apache on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:29] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.151 second response time [00:17:39] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:18:19] RECOVERY - Nginx local proxy to apache on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 8.574 second response time [00:18:29] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 73062 bytes in 1.422 second response time [00:43:09] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:52:39] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:03] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073585 (10Legoktm) 00:47, 5 March 2017 Sjoerddebruin (talk | contribs) blocked Emijrpbot (talk | contribs) with an expiration time of indefinite (account creation disabled, a... [00:54:29] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 73034 bytes in 1.759 second response time [01:01:39] PROBLEM - Check systemd state on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:29] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [01:03:29] PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:04:19] RECOVERY - Check whether ferm is active by checking the default input chain on cobalt is OK: OK ferm input default policy is set [01:06:19] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:06:39] PROBLEM - etcdmirror--eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror--eqiad-wmnet is failed [01:07:09] PROBLEM - dhclient process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:09] PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:10] PROBLEM - DPKG on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:10] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [01:07:19] PROBLEM - puppet last run on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:29] PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:09] PROBLEM - gerrit process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:19] PROBLEM - salt-minion processes on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:39] PROBLEM - MD RAID on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:40] PROBLEM - Disk space on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:59] RECOVERY - dhclient process on cobalt is OK: PROCS OK: 0 processes with command name dhclient [01:08:59] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [01:08:59] RECOVERY - DPKG on cobalt is OK: All packages OK [01:08:59] RECOVERY - configured eth on cobalt is OK: OK - interfaces up [01:09:09] RECOVERY - salt-minion processes on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:09:09] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures [01:09:19] RECOVERY - Check whether ferm is active by checking the default input chain on cobalt is OK: OK ferm input default policy is set [01:09:49] PROBLEM - Check size of conntrack table on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:09] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [01:10:29] RECOVERY - MD RAID on cobalt is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:10:29] RECOVERY - Disk space on cobalt is OK: DISK OK [01:10:39] RECOVERY - Check size of conntrack table on cobalt is OK: OK: nf_conntrack is 0 % full [01:11:09] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [01:12:09] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set [01:13:09] PROBLEM - puppet last run on es1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:41:09] RECOVERY - puppet last run on es1013 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [01:43:19] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [01:44:19] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.27 ms [02:06:45] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073330 (10Betacommand) Suggestion, when the job queue gets too high set the maxlag parameter to a higher value, most bots use that as a throttle. [02:12:09] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:18:42] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.14) (duration: 07m 07s) [02:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:02] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Mar 5 02:24:02 UTC 2017 (duration 5m 20s) [02:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:51] (03PS1) 10Madhuvishy: paws-internal: Add support for serving static files authenticated via ldap [puppet] - 10https://gerrit.wikimedia.org/r/341188 [02:37:15] (03CR) 10jerkins-bot: [V: 04-1] paws-internal: Add support for serving static files authenticated via ldap [puppet] - 10https://gerrit.wikimedia.org/r/341188 (owner: 10Madhuvishy) [02:39:09] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [02:40:06] (03PS2) 10Madhuvishy: paws-internal: Add support for serving static files authenticated via ldap [puppet] - 10https://gerrit.wikimedia.org/r/341188 [02:45:41] (03PS3) 10Madhuvishy: paws-internal: Add support for serving static files authenticated via ldap [puppet] - 10https://gerrit.wikimedia.org/r/341188 [02:49:31] (03CR) 10Madhuvishy: [C: 032] paws-internal: Add support for serving static files authenticated via ldap [puppet] - 10https://gerrit.wikimedia.org/r/341188 (owner: 10Madhuvishy) [02:53:59] PROBLEM - Check systemd state on notebook1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:54:29] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[apache2] [03:02:19] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[apache2] [03:02:59] PROBLEM - Check systemd state on notebook1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:09:53] (03PS1) 10Madhuvishy: paws-internal: Add apache ldap auth puppet dependencies [puppet] - 10https://gerrit.wikimedia.org/r/341189 [03:11:15] (03CR) 10Madhuvishy: [C: 032] paws-internal: Add apache ldap auth puppet dependencies [puppet] - 10https://gerrit.wikimedia.org/r/341189 (owner: 10Madhuvishy) [03:13:29] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [03:13:59] RECOVERY - Check systemd state on notebook1001 is OK: OK - running: The system is fully operational [03:13:59] RECOVERY - Check systemd state on notebook1002 is OK: OK - running: The system is fully operational [03:14:19] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:22:39] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 669.91 seconds [03:22:39] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:28:39] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 64.95 seconds [03:33:19] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:33:29] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:51:40] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [04:00:29] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [04:01:19] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [04:46:09] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:19] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:14:09] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [05:24:09] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:30:19] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [05:34:19] PROBLEM - puppet last run on restbase1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:53:09] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [05:55:29] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:02:19] RECOVERY - puppet last run on restbase1018 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:02:47] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073686 (10Legoktm) p:05Unbreak!>03High Going down slowly... [06:08:10] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:23:29] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:32:09] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:36:09] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:54:39] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:01:09] RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:11:09] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:23:30] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:39:09] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:41:17] (03PS1) 10Yuvipanda: tools: Allow readonly access to all namespace objects [puppet] - 10https://gerrit.wikimedia.org/r/341191 [07:41:43] (03PS2) 10Yuvipanda: tools: Allow readonly access to all namespace objects [puppet] - 10https://gerrit.wikimedia.org/r/341191 [07:53:33] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073330 (10Emijrp) All my bots follow the maxlag policy, as defined by default in Pywikibot user-config.py. [08:07:39] PROBLEM - puppet last run on logstash1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:36:25] (03PS3) 10Yuvipanda: tools: Allow readonly access to all namespace objects [puppet] - 10https://gerrit.wikimedia.org/r/341191 [08:36:39] RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:50:31] (03PS1) 10MarcoAurelio: Add 'flow-create-board' to CommonSettings.php for global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341193 [09:54:29] PROBLEM - Hadoop DataNode on analytics1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [09:57:09] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-hdfs-datanode] [10:10:49] this one is a disk issue --^ [10:17:11] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 06DC-Ops: Analytics1028 hdfs daemon died because of disk errors - https://phabricator.wikimedia.org/T159632#3073705 (10elukey) [10:19:06] !log disabled puppet on analytics1028 to avoid puppet to start the HDFS daemon (T159632) [10:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:12] T159632: Analytics1028 hdfs daemon died because of disk errors - https://phabricator.wikimedia.org/T159632 [10:20:23] this host is a bit important since it is one of the three Hadoop HDFS journal nodes, but the HDFS daemon seems the only one impacted [10:23:36] so I stopped Yarn nodemanager too, scheduled downtime and left the journalnode daemon up and running (since it seems working fine) [10:24:09] not sure if the disk will be swapped soon next week, so tomorrow I'll move the journalnode to analytics1029 probably [10:24:14] but for the moment everything seems fine [10:24:28] * elukey sending an email to analytics [10:25:10] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [10:25:11] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 06DC-Ops: Analytics1028 hdfs daemon died because of disk errors - https://phabricator.wikimedia.org/T159632#3073732 (10elukey) I also stopped the Yarn node manager but not the journalnode, will probably move it to analytics1029 tomorrow. [10:30:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [10:34:47] all right done, everything seems good [10:35:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [10:40:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [10:45:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [10:50:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [10:55:09] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 268 seconds ago with 0 failures [11:20:22] (03PS1) 10Mbch331: WIP: Remove exception on Other Projects sidebar for Dutch Wikipedia Bug: T159634 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341195 (https://phabricator.wikimedia.org/T159634) [12:03:39] (03PS2) 10Urbanecm: WIP: Remove exception on Other Projects sidebar for Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341195 (https://phabricator.wikimedia.org/T159634) (owner: 10Mbch331) [12:05:29] (03CR) 10Urbanecm: [C: 04-1] "Without consensus, clarifying in the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341195 (https://phabricator.wikimedia.org/T159634) (owner: 10Mbch331) [12:26:55] (03PS3) 10Mbch331: WIP: Remove exception on Other Projects sidebar for Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341195 (https://phabricator.wikimedia.org/T159634) [12:29:19] PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [12:35:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [12:40:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [12:45:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [12:50:09] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:57:19] RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:02:19] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:12:09] PROBLEM - puppet last run on restbase1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:16:08] 06Operations, 10Wikimedia-General-or-Unknown: GenerateFancyCaptchas cronjob should output to logfile - https://phabricator.wikimedia.org/T159610#3073831 (10Florian) a:03Florian [13:19:40] 06Operations, 07Puppet: GenerateFancyCaptchas cronjob should output to logfile - https://phabricator.wikimedia.org/T159610#3073832 (10Florian) [13:31:19] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:32:52] (03PS1) 10Florianschmidtwelzow: Save logs of generate CAPTCHA cron to /var/log/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/341197 (https://phabricator.wikimedia.org/T159610) [13:41:09] RECOVERY - puppet last run on restbase1017 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [13:57:39] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:22:40] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:29] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:26:39] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [14:49:19] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [14:52:21] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:07:39] PROBLEM - puppet last run on prometheus1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:39] RECOVERY - puppet last run on prometheus1002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:48:09] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:09] PROBLEM - puppet last run on es1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:17:09] RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:40:09] RECOVERY - puppet last run on es1013 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:23:09] PROBLEM - puppet last run on wdqs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:34:36] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073982 (10Legoktm) >>! In T159618#3073691, @Emijrp wrote: > All my bots follow the maxlag policy, as defined by default in Pywikibot user-config.py. Can you add a ratelimit... [17:45:59] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [17:46:39] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:47:29] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4223399 keys, up 125 days 9 hours - replication_delay is 648 [17:47:29] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 650 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4223445 keys, up 125 days 9 hours - replication_delay is 650 [17:51:09] RECOVERY - puppet last run on wdqs1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:00:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [18:05:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [18:09:39] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [18:09:59] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [18:10:09] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 213 seconds ago with 0 failures [18:10:29] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:15:09] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:38:29] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [18:43:09] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:47:03] Dereckson: are you around ? [18:55:33] anyone here that wants/can do server side upload ? [19:11:23] hi matanya, how can I help you? [19:24:35] Dereckson: hi, i'd like to do a server side upload, would a ticket be the right place ? [19:26:25] Generally speaking a ticket is the right place afaik [19:27:43] I'd note, that in theory the size limit for server side upload is the same as for normal uploads now a days [19:28:25] but you can't uploaded stuff automatically if there are gazillion files [19:31:38] That is true [19:32:43] matanya: feel free to create a task and put me as subscriber, I'll handle it [19:33:12] Dereckson: https://phabricator.wikimedia.org/T159650 [19:34:22] matanya: how can I transfer from encoding01.eqiad.wmflabs to Terbium? [19:34:40] Dereckson: scp would do, i guess [19:38:29] matanya: ask zhuyifei1999_ there are some tricks to share it on the web [19:40:36] Dereckson: will it help ? i think you should have the right to pull them, and if not, i can handle the sharing, i guess [19:40:44] matanya: one way would be reuse the urls that serves v2c files [19:41:03] zhuyifei1999_: it is in /srv/matanya on encoding01 [19:41:18] what should be done to re-use it ? [19:41:59] copy to /srv/v2c/ssu? [19:42:21] or you can get nginx to point to /srv/matanya [19:42:30] add a proxy [19:43:19] * zhuyifei1999_ gtg [19:43:28] thanks zhuyifei1999_ [19:43:37] np [19:44:22] * zhuyifei1999_ gtg [19:48:54] Dereckson: shared [19:58:24] (03PS3) 10Urbanecm: Bs.wiktionary namespace changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341035 (https://phabricator.wikimedia.org/T159538) [20:30:27] (03PS1) 10Brian Wolff: Extend the upload Content-Security-Policy test to other large wikis [puppet] - 10https://gerrit.wikimedia.org/r/341207 (https://phabricator.wikimedia.org/T117618) [20:39:05] Dereckson: i updated the ticket with the info on the web [20:40:43] ok [20:47:29] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4195505 keys, up 125 days 12 hours - replication_delay is 2 [20:49:19] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [20:52:19] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:57:29] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 602 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4195505 keys, up 125 days 12 hours - replication_delay is 602 [20:59:09] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:27:09] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:17:51] (03PS4) 10Paladox: Gerrit: Add some apache rewrite rules for polygerrit [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) [22:17:59] (03PS1) 10Brian Wolff: Add a CSP policy to foundationwiki to prevent privacy breach [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341259 (https://phabricator.wikimedia.org/T159386) [22:18:01] (03PS5) 10Paladox: Gerrit: Add some apache rewrite rules for polygerrit [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) [22:18:09] matanya: ping? [22:18:38] yes Dereckson ? [22:18:44] matanya: https://commons.wikimedia.org/wiki/File:MAZEN_DAOUD.jpg [22:19:09] apparenrly https://tools.wmflabs.org/video2commons/static/ssu/ABBAS_ASSI.jpg.txt is 404 [22:19:12] i am fixing it [22:19:52] What syntax did you use for the description filenames? [22:20:35] what do you mean ? [22:20:55] (03PS6) 10Paladox: Gerrit: Add some apache rewrite rules for polygerrit [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) [22:21:15] The upload script expects the original filename + a suffix e.g. ABBAS_ASSI.jpg.txt for ABBAS_ASSI.jpg [22:21:24] true [22:21:50] So I followed your "Description files are available too: append .txt to the images." [22:22:05] ah, i have them as separated files, sorry [22:22:25] they are such as ABBAS_ASSI.txt [22:22:32] ok [22:22:42] i can change that if you wish [22:22:57] Yes, rename them, I'll resume the upload, with the correct description files. [22:23:08] But for the already uploaded files, you'll need to fix that manually. [22:23:15] ok [22:23:28] The script can't help: it only publishes a revision for new uploads, not for reuploads. [22:23:41] !log Generating some more captchas again T159581 [22:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:47] T159581: The same CAPTCHA image is always used across platforms and refresh - https://phabricator.wikimedia.org/T159581 [22:32:25] 06Operations, 10Gerrit: Decide how to support polygerrit - https://phabricator.wikimedia.org/T158479#3074214 (10Paladox) I've managed to fix it upstream at https://gerrit-review.googlesource.com/#/c/99004/ :) [22:32:39] PROBLEM - puppet last run on wtp1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:35:09] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [22:40:09] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [22:40:53] ^^ is that intentional? [22:41:29] jouncebot: refresh [22:41:36] I refreshed my knowledge about deployments. [22:44:52] Dereckson: if i delete the pages, will it re-upload ? [22:45:09] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [22:45:37] matanya: er yes [22:45:57] so i prefer this way, if you don't mind [22:46:32] * Dereckson nods [22:46:44] will be faster indeed [22:47:41] doing shortly [22:50:09] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [22:51:53] I've added a -f to my curl alias, so it won't download 404 pages next time. [22:55:09] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [22:57:16] thanks [22:59:03] matanya: I think I've successfully deleted them (with Special:Nuke) [22:59:15] yes, i can confirm that [22:59:48] please wait while the correct desc files are being re-created [22:59:52] okay, so I'm ready to reupload as soon as you've renamed *.txt to *.jpg.txt [22:59:58] a matter of a minute or two [23:00:00] * Dereckson nods [23:00:09] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [23:00:39] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [23:05:09] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 297 seconds ago with 0 failures [23:09:37] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3074250 (10Emijrp) Done. I added a put_throttle = 3 seconds. Anyway I think I will wait for the job queue to go down. [23:11:58] (03PS1) 10Brian Wolff: Change account creation throttle for idwiki to default (6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341263 [23:18:49] Dereckson: done [23:18:58] took longer than i thought it would [23:21:48] k [23:22:51] Dereckson: i see another issue [23:22:58] did you start ? [23:23:19] Not yet, you can still fix it. [23:26:39] ok Dereckson it is a go [23:26:54] i hope i spotted them all [23:29:05] ok [23:32:43] https://commons.wikimedia.org/wiki/File:YEHEZKEL_HAREL.jpg #better [23:34:36] matanya: strange, after https://phabricator.wikimedia.org/T159650#3074285 it stopped oO [23:35:25] Dereckson: any error message ? [23:35:58] nope [23:36:41] maybe just try again ? [23:36:47] I've rm the uploaded and launched again, yes [23:42:51] matanya: I confirm my log is full blue, so all deleted files have been reuploaded [23:43:07] thank you so much! [23:43:33] for your help, your patience and kindness :) [23:43:39] You're welcome.